Programme

Note that all time slots listed below are in Eastern Standard Time (UTC-5) and that all sessions take place in the Brickell room. The programme is tentative at the moment!

Morning programme

09:00-09:15 AM — Opening remarks

09:15-10:00 AM — Keynote 1, by Najoung Kim

Najoung Kim Speaker

Title: Semantic generalizations in humans and machines

Abstract: Do machines “understand”? Empirically addressing this question requires an operationalization of what it means to understand. In this talk, I will discuss three tests for machines grounded in formal semantic theories that characterize various aspects of human linguistic understanding, examining the capacities to assign adequate meaning representations to linguistic expressions, to track entities and their states in discourse, and to draw adequate inferences from complex expressions. Critically, these capacities must generalize to unseen expressions. I will discuss the findings from these studies contextualized with respect to human capacities.

Bio: Najoung Kim is an Assistant Professor at the Department of Linguistics and an affiliate faculty in the Department of Computer Science at Boston University. She is also currently a visiting faculty researcher at Google DeepMind. Before joining BU, she was a Faculty Fellow at the Center for Data Science at New York University and received her PhD in Cognitive Science at Johns Hopkins University. She is interested in studying meaning in both human and machine learners, especially ways in which they generalize to novel inputs and ways in which they treat implicit meaning. Her research has been supported by NSF and Google, and has received awards at venues such as ACL and *SEM.

10:00-10:30 AM — Oral presentations

10:00-10:15 PM — Is artificial intelligence still intelligence? LLMs generalize to novel adjective-noun pairs, but don’t mimic the full human distribution
Hayley Ross (presenter), Kathryn Davidson, Najoung Kim
10:15-10:30 PM — Investigating the Generalizability of Pretrained Language Models across Multiple Dimensions: A Case Study of NLI and MRC
Ritam Dutt, Sagnik Ray Choudhury (presenter), Varun Venkat Rao, Carolyn Rose, V.G. Vinod Vydiswaran

10:30-11:00 AM — Coffee break

11:00-11:45 AM — Keynote 2, by Kyle Lo

Kyle Lo Speaker

Bio: Kyle Lo is a research scientist at the Allen Institute for AI in Seattle, co-leading the pretraining data team for OLMo. His research focuses on open language models, domain adaptation and specialization, evaluation methods, and human-AI interaction. His award-winning work has appeared in major conferences like ACL, EMNLP and CHI, and been featured articles in Nature, Science, TechCrunch and others. In 2020, he co-led a White House OSTP initiative to publicly release the largest collection of COVID-19 research for computing use. Kyle holds a Statistics degree from the University of Washington and enjoys board games, boba tea, D&D, and his cat Belphegor.

11:45-12:30 AM — Spotlight presentations

MLissard: Multilingual Long and Simple Sequential Reasoning Benchmarks
Mirelle Candida Bueno (presenter), Roberto Lotufo, Rodrigo Frassetto Nogueira
OmniDialog: A Multimodal Benchmark for Generalization Across Text, Visual, and Audio Modalities
Anton Razzhigaev, Maxim Kurkin (presenter), Elizaveta Goncharova, Irina Abdullaeva, Anastasia Lysenko, Alexander Panchenko, Andrey Kuznetsov, Denis Dimitrov
MultiPragEval: Multilingual Pragmatic Evaluation of Large Language Models
Dojun Park, Jiwoo Lee (presenter), Seohyun Park, Hyeyun Jeong, Youngeun Koo, Soonha Hwang, Seonwoo Park, Sungeun Lee
The SlayQA benchmark of social reasoning: testing gender-inclusive generalization with neopronouns
Bastian Bunzeck (presenter), Sina Zarrieß
MMLU-SR: A Benchmark for Stress-Testing Reasoning Capability of Large Language Models
Presenter: Hengyi Wang, Authors: Wentian Wang, Sarthak Jain, Paul Kantor, Jacob Feldman, Lazaros Gallos, Hao Wang

12:30-1:45 PM — Lunch break

Afternoon programme

1:45-3:00 PM — Poster session

GenBench MLissard: Multilingual Long and Simple Sequential Reasoning Benchmarks
Mirelle Candida Bueno, Roberto Lotufo, Rodrigo Frassetto Nogueira
GenBench OmniDialog: A Multimodal Benchmark for Generalization Across Text, Visual, and Audio Modalities
Anton Razzhigaev, Maxim Kurkin, Elizaveta Goncharova, Irina Abdullaeva, Anastasia Lysenko, Alexander Panchenko, Andrey Kuznetsov, Denis Dimitrov
GenBench MultiPragEval: Multilingual Pragmatic Evaluation of Large Language Models
Dojun Park, Jiwoo Lee, Seohyun Park, Hyeyun Jeong, Youngeun Koo, Soonha Hwang, Seonwoo Park, Sungeun Lee
GenBench From Language to Pixels: Task Recognition and Task Learning in LLMs
Janek Falkenstein, Carolin M. Schuster, Alexander H Berger, Georg Groh
GenBench Automated test generation to evaluate tool-augmented LLMs as conversational AI agents
Samuel Arcadinho, David Oliveira Aparicio, Mariana S. C. Almeida
GenBench Beyond the Numbers: Transparency in Relation Extraction Benchmark Creation and Leaderboards
Varvara Arzt, Allan Hanbury
GenBench CHIE: Generative MRC Evaluation for in-context QA with Correctness, Helpfulness, Irrelevancy, and Extraneousness Aspects
Wannaphong Phatthiyaphaibun, Surapon Nonesung, Peerat Limkonchotiwat, Can Udomcharoenchaikit, Jitkapat Sawatphol, Ekapol Chuangsuwanich, Sarana Nutanong
GenBench Evaluating the fairness of task-adaptive pretraining on unlabeled test data before few-shot text classification
Kush Dubey
GenBench Towards a new Benchmark for Emotion Detection in NLP: A Unifying Framework of Recent Corpora
Anna Koufakou, Elijah Nieves, John Peller
GenBench CBT The SlayQA benchmark of social reasoning: testing gender-inclusive generalization with neopronouns
Bastian Bunzeck, Sina Zarrieß
GenBench CBT MMLU-SR: A Benchmark for Stress-Testing Reasoning Capability of Large Language Models
Presenter: Hengyi Wang, Authors: Wentian Wang, Sarthak Jain, Paul Kantor, Jacob Feldman, Lazaros Gallos, Hao Wang
GenBench Non-archival A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners
Bowen Jiang, Yangxinyu Xie, Zhuoqun Hao, Xiaomeng Wang, Tanwi Mallick, Weijie J Su, Camillo Jose Taylor, Dan Roth
GenBench Non-archival Cross-Domain Question Generation: A Comparative Study
Niloufar Beyranvand, Aijun An, Heidar Davoudi
GenBench Non-archival The Relationship Between Compositional Generalization and Misinformation in Emergent Communication
Heeyoung Lee
GenBench Non-archival NeSyCoCo: A Neuro-Symbolic Concept Composer for Compositional Generalization
Danial Kamali, Elham Barezi, Parisa Kordjamshidi
GenBench Non-archival Benchmarking Foundation Models on Exceptional Cases: Dataset Creation and Validation
Suho Kang, Jungyang Park, Joonseo Ha, SoMin Kim, JinHyeong Kim, Subeen Park, Kyungwoo Song
GenBench Non-archival LUCY: Linking Uncertainty and ConsistencY of Large Language Models for Question Answering
Urja Khurana, Lea Krause
GenBench Non-archival Can General-Purpose Large Language Models Generalize to English-Thai Machine Translation?
Jirat Chiaranaipanich, Naiyarat Hanmatheekuna, Jitkapat Sawatphol, Krittamate Tiankanon, Jiramet Kinchagawat, Amrest Chinkamol, Parinthapat Pengpun, Piyalitt Ittichaiwong, Peerat Limkonchotiwat
GenBench Non-archival Towards Dynamic and Realistic Evaluation of Multi-modal Large Language Model
Huiqi Zou, Yijiang Li, Ziang Xiao
GenBench Non-archival Leveraging Isomorphisms to facilitate Zero-Shot KBQA Generalization
Ritam Dutt, Dongfang Ling, Yu Gu, Carolyn Rose
Findings Measuring the Robustness of NLP Models to Domain Shifts
Nitay Calderon, Naveh Porat, Eyal Ben-David, Alexander Chapanin, Zorik Gekhman, Nadav Oved, Vitaly Shalumov, Roi Reichart
Findings Reconfidencing LLM Uncertainty from the Grouping Loss Perspective
Lihu Chen, Alexandre Perez-Lebel, Fabian M. Suchanek, Gaël Varoquaux

3:00-3:45 PM — Keynote 3, by Sameer Singh

Sameer Singh Speaker

Bio:Dr. Sameer Singh is a Professor of Computer Science at the University of California, Irvine (UCI) and a Cofounder/CTO of Spiffy AI. He is working primarily on the evaluation, robustness, and interpretability of machine learning algorithms and large models that reason with text and structure for natural language processing. He has been named the Kavli Fellow by the National Academy of Sciences, received the NSF CAREER award, UCI Distinguished Early Career Faculty award, the Hellman Faculty Fellowship, and was selected as a DARPA Riser. His group has received funding from Allen Institute for AI, Amazon, NSF, DARPA, Adobe Research, Hasso Plattner Institute, NEC, Base 11, and FICO. Sameer has published extensively at machine learning and natural language processing venues and received numerous paper awards, including at KDD 2016, ACL 2018, EMNLP 2019, AKBC 2020, ACL 2020, and NAACL 2022.

Workshop programme