Programme

Note that all time slots listed below are in Eastern Standard Time (UTC-5). The programme is tentative at the moment!

Morning programme

09:00-09:15 AM — Opening remarks

09:15-10:00 AM — Keynote 1, by Pascale Fung

10:00-10:30 AM — Oral presentations

  • 14:45-15:00 PM — Is artificial intelligence still intelligence? LLMs generalize to novel adjective-noun pairs, but don’t mimic the full human distribution
    Hayley Ross, Kathryn Davidson, Najoung Kim

  • 15:00-15:15 PM — Investigating the Generalizability of Pretrained Language Models across Multiple Dimensions: A Case Study of NLI and MRC
    Ritam Dutt, Sagnik Ray Choudhury, Varun Venkat Rao, Carolyn Rose, V.G. Vinod Vydiswaran

10:30-11:00 AM — Coffee break

11:00-11:45 AM — Keynote 2, by Najoung Kim

11:45-12:30 AM — Spotlight presentations

  • MLissard: Multilingual Long and Simple Sequential Reasoning Benchmarks
    Mirelle Candida Bueno, Roberto Lotufo, Rodrigo Frassetto Nogueira

  • OmniDialog: A Multimodal Benchmark for Generalization Across Text, Visual, and Audio Modalities
    Anton Razzhigaev, Maxim Kurkin, Elizaveta Goncharova, Irina Abdullaeva, Anastasia Lysenko, Alexander Panchenko, Andrey Kuznetsov, Denis Dimitrov

  • MultiPragEval: Multilingual Pragmatic Evaluation of Large Language Models
    Dojun Park, Jiwoo Lee, Seohyun Park, Hyeyun Jeong, Youngeun Koo, Soonha Hwang, Seonwoo Park, Sungeun Lee

  • The SlayQA benchmark of social reasoning: testing gender-inclusive generalization with neopronouns
    Bastian Bunzeck, Sina Zarrieß

  • MMLU-SR: A Benchmark for Stress-Testing Reasoning Capability of Large Language Models
    Wentian Wang, Sarthak Jain, Paul Kantor, Jacob Feldman, Lazaros Gallos, Hao Wang

12:30-1:30 PM — Lunch break

Afternoon programme

1:30-2:45 PM — Poster session

Click to toggle an overview of the posters that will be presented in this session.
  • GenBench MLissard: Multilingual Long and Simple Sequential Reasoning Benchmarks
    Mirelle Candida Bueno, Roberto Lotufo, Rodrigo Frassetto Nogueira
  • GenBench OmniDialog: A Multimodal Benchmark for Generalization Across Text, Visual, and Audio Modalities
    Anton Razzhigaev, Maxim Kurkin, Elizaveta Goncharova, Irina Abdullaeva, Anastasia Lysenko, Alexander Panchenko, Andrey Kuznetsov, Denis Dimitrov
  • GenBench MultiPragEval: Multilingual Pragmatic Evaluation of Large Language Models
    Dojun Park, Jiwoo Lee, Seohyun Park, Hyeyun Jeong, Youngeun Koo, Soonha Hwang, Seonwoo Park, Sungeun Lee
  • GenBench From Language to Pixels: Task Recognition and Task Learning in LLMs
    Janek Falkenstein, Carolin M. Schuster, Alexander H Berger, Georg Groh
  • GenBench Automated test generation to evaluate tool-augmented LLMs as conversational AI agents
    Samuel Arcadinho, David Oliveira Aparicio, Mariana S. C. Almeida
  • GenBench Beyond the Numbers: Transparency in Relation Extraction Benchmark Creation and Leaderboards
    Varvara Arzt, Allan Hanbury
  • GenBench CHIE: Generative MRC Evaluation for in-context QA with Correctness, Helpfulness, Irrelevancy, and Extraneousness Aspects
    Wannaphong Phatthiyaphaibun, Surapon Nonesung, Peerat Limkonchotiwat, Can Udomcharoenchaikit, Jitkapat Sawatphol, Ekapol Chuangsuwanich, Sarana Nutanong
  • GenBench Evaluating the fairness of task-adaptive pretraining on unlabeled test data before few-shot text classification
    Kush Dubey
  • GenBench Towards a new Benchmark for Emotion Detection in NLP: A Unifying Framework of Recent Corpora
    Anna Koufakou, Elijah Nieves, John Peller
  • GenBench CBT The SlayQA benchmark of social reasoning: testing gender-inclusive generalization with neopronouns
    Bastian Bunzeck, Sina Zarrieß
  • GenBench CBT MMLU-SR: A Benchmark for Stress-Testing Reasoning Capability of Large Language Models
    Wentian Wang, Sarthak Jain, Paul Kantor, Jacob Feldman, Lazaros Gallos, Hao Wang
  • GenBench Non-archival A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners
    Bowen Jiang, Yangxinyu Xie, Zhuoqun Hao, Xiaomeng Wang, Tanwi Mallick, Weijie J Su, Camillo Jose Taylor, Dan Roth
  • GenBench Non-archival Cross-Domain Question Generation: A Comparative Study
    Niloufar Beyranvand, Aijun An, Heidar Davoudi
  • GenBench Non-archival The Relationship Between Compositional Generalization and Misinformation in Emergent Communication
    Heeyoung Lee
  • GenBench Non-archival NeSyCoCo: A Neuro-Symbolic Concept Composer for Compositional Generalization
    Danial Kamali, Elham Barezi, Parisa Kordjamshidi
  • GenBench Non-archival Benchmarking Foundation Models on Exceptional Cases: Dataset Creation and Validation
    Suho Kang, Jungyang Park, Joonseo Ha, SoMin Kim, JinHyeong Kim, Subeen Park, Kyungwoo Song
  • GenBench Non-archival LUCY: Linking Uncertainty and ConsistencY of Large Language Models for Question Answering
    Urja Khurana, Lea Krause
  • GenBench Non-archival Can General-Purpose Large Language Models Generalize to English-Thai Machine Translation?
    Jirat Chiaranaipanich, Naiyarat Hanmatheekuna, Jitkapat Sawatphol, Krittamate Tiankanon, Jiramet Kinchagawat, Amrest Chinkamol, Parinthapat Pengpun, Piyalitt Ittichaiwong, Peerat Limkonchotiwat
  • GenBench Non-archival Towards Dynamic and Realistic Evaluation of Multi-modal Large Language Model
    Huiqi Zou, Yijiang Li, Ziang Xiao
  • GenBench Non-archival Leveraging Isomorphisms to facilitate Zero-Shot KBQA Generalization
    Ritam Dutt, Dongfang Ling, Yu Gu, Carolyn Rose
  • Findings Measuring the Robustness of NLP Models to Domain Shifts
    Nitay Calderon, Naveh Porat, Eyal Ben-David, Alexander Chapanin, Zorik Gekhman, Nadav Oved, Vitaly Shalumov, Roi Reichart

2:45-3:30 PM — Keynote 3, by Sameer Singh

3:30-4:00 PM — Coffee break

4:00-4:30 PM — Panel

4:30-4:45 PM — Closing remarks and best paper award