Programme
Note that all time slots listed below are in Eastern Standard Time (UTC-5). The programme is tentative at the moment!
Morning programme
09:00-09:15 AM — Opening remarks
09:15-10:00 AM — Keynote 1, by Pascale Fung
10:00-10:30 AM — Oral presentations
-
14:45-15:00 PM — Is artificial intelligence still intelligence? LLMs generalize to novel adjective-noun pairs, but don’t mimic the full human distribution
Hayley Ross, Kathryn Davidson, Najoung Kim -
15:00-15:15 PM — Investigating the Generalizability of Pretrained Language Models across Multiple Dimensions: A Case Study of NLI and MRC
Ritam Dutt, Sagnik Ray Choudhury, Varun Venkat Rao, Carolyn Rose, V.G. Vinod Vydiswaran
10:30-11:00 AM — Coffee break
11:00-11:45 AM — Keynote 2, by Najoung Kim
11:45-12:30 AM — Spotlight presentations
-
MLissard: Multilingual Long and Simple Sequential Reasoning Benchmarks
Mirelle Candida Bueno, Roberto Lotufo, Rodrigo Frassetto Nogueira -
OmniDialog: A Multimodal Benchmark for Generalization Across Text, Visual, and Audio Modalities
Anton Razzhigaev, Maxim Kurkin, Elizaveta Goncharova, Irina Abdullaeva, Anastasia Lysenko, Alexander Panchenko, Andrey Kuznetsov, Denis Dimitrov -
MultiPragEval: Multilingual Pragmatic Evaluation of Large Language Models
Dojun Park, Jiwoo Lee, Seohyun Park, Hyeyun Jeong, Youngeun Koo, Soonha Hwang, Seonwoo Park, Sungeun Lee -
The SlayQA benchmark of social reasoning: testing gender-inclusive generalization with neopronouns
Bastian Bunzeck, Sina Zarrieß -
MMLU-SR: A Benchmark for Stress-Testing Reasoning Capability of Large Language Models
Wentian Wang, Sarthak Jain, Paul Kantor, Jacob Feldman, Lazaros Gallos, Hao Wang
12:30-1:30 PM — Lunch break
Afternoon programme
1:30-2:45 PM — Poster session
Click to toggle an overview of the posters that will be presented in this session.
-
GenBench MLissard: Multilingual Long and Simple Sequential Reasoning Benchmarks
Mirelle Candida Bueno, Roberto Lotufo, Rodrigo Frassetto Nogueira -
GenBench OmniDialog: A Multimodal Benchmark for Generalization Across Text, Visual, and Audio Modalities
Anton Razzhigaev, Maxim Kurkin, Elizaveta Goncharova, Irina Abdullaeva, Anastasia Lysenko, Alexander Panchenko, Andrey Kuznetsov, Denis Dimitrov -
GenBench MultiPragEval: Multilingual Pragmatic Evaluation of Large Language Models
Dojun Park, Jiwoo Lee, Seohyun Park, Hyeyun Jeong, Youngeun Koo, Soonha Hwang, Seonwoo Park, Sungeun Lee -
GenBench From Language to Pixels: Task Recognition and Task Learning in LLMs
Janek Falkenstein, Carolin M. Schuster, Alexander H Berger, Georg Groh -
GenBench Automated test generation to evaluate tool-augmented LLMs as conversational AI agents
Samuel Arcadinho, David Oliveira Aparicio, Mariana S. C. Almeida -
GenBench Beyond the Numbers: Transparency in Relation Extraction Benchmark Creation and Leaderboards
Varvara Arzt, Allan Hanbury -
GenBench CHIE: Generative MRC Evaluation for in-context QA with Correctness, Helpfulness, Irrelevancy, and Extraneousness Aspects
Wannaphong Phatthiyaphaibun, Surapon Nonesung, Peerat Limkonchotiwat, Can Udomcharoenchaikit, Jitkapat Sawatphol, Ekapol Chuangsuwanich, Sarana Nutanong -
GenBench Evaluating the fairness of task-adaptive pretraining on unlabeled test data before few-shot text classification
Kush Dubey -
GenBench Towards a new Benchmark for Emotion Detection in NLP: A Unifying Framework of Recent Corpora
Anna Koufakou, Elijah Nieves, John Peller -
GenBench CBT The SlayQA benchmark of social reasoning: testing gender-inclusive generalization with neopronouns
Bastian Bunzeck, Sina Zarrieß -
GenBench CBT MMLU-SR: A Benchmark for Stress-Testing Reasoning Capability of Large Language Models
Wentian Wang, Sarthak Jain, Paul Kantor, Jacob Feldman, Lazaros Gallos, Hao Wang -
GenBench Non-archival A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners
Bowen Jiang, Yangxinyu Xie, Zhuoqun Hao, Xiaomeng Wang, Tanwi Mallick, Weijie J Su, Camillo Jose Taylor, Dan Roth -
GenBench Non-archival Cross-Domain Question Generation: A Comparative Study
Niloufar Beyranvand, Aijun An, Heidar Davoudi -
GenBench Non-archival The Relationship Between Compositional Generalization and Misinformation in Emergent Communication
Heeyoung Lee -
GenBench Non-archival NeSyCoCo: A Neuro-Symbolic Concept Composer for Compositional Generalization
Danial Kamali, Elham Barezi, Parisa Kordjamshidi -
GenBench Non-archival Benchmarking Foundation Models on Exceptional Cases: Dataset Creation and Validation
Suho Kang, Jungyang Park, Joonseo Ha, SoMin Kim, JinHyeong Kim, Subeen Park, Kyungwoo Song -
GenBench Non-archival LUCY: Linking Uncertainty and ConsistencY of Large Language Models for Question Answering
Urja Khurana, Lea Krause -
GenBench Non-archival Can General-Purpose Large Language Models Generalize to English-Thai Machine Translation?
Jirat Chiaranaipanich, Naiyarat Hanmatheekuna, Jitkapat Sawatphol, Krittamate Tiankanon, Jiramet Kinchagawat, Amrest Chinkamol, Parinthapat Pengpun, Piyalitt Ittichaiwong, Peerat Limkonchotiwat -
GenBench Non-archival Towards Dynamic and Realistic Evaluation of Multi-modal Large Language Model
Huiqi Zou, Yijiang Li, Ziang Xiao -
GenBench Non-archival Leveraging Isomorphisms to facilitate Zero-Shot KBQA Generalization
Ritam Dutt, Dongfang Ling, Yu Gu, Carolyn Rose -
Findings Measuring the Robustness of NLP Models to Domain Shifts
Nitay Calderon, Naveh Porat, Eyal Ben-David, Alexander Chapanin, Zorik Gekhman, Nadav Oved, Vitaly Shalumov, Roi Reichart