Programme
Note that all time slots listed below are in Singapore Standard Time (GMT+8), and that all activities take place in the Central Ballroom 3.
Morning programme
09:00-09:15 AM — Opening remarks
09:15-10:00 AM — Keynote 1, by Anna Rogers
Title: A sanity check on emergent properties
Abstract: One of the frequent points in the mainstream narrative about large language models is that they have “emergent properties” (sometimes even dangerous enough to be considered existential risk to mankind). However, there is much disagreement about even the very definition of such properties. If they are understood as a kind of generalization beyond training data - as something that a model does without being explicitly trained for it - I argue that we have not in fact established the existence of any such properties, and at the moment we do not even have the methodology for doing so.
10:00-11:15 AM — Poster session 1
Click to toggle an overview of the posters that will be presented in this session.
-
GenBench Temporal Generalizability in Multimodal Misinformation Detection
Nataliya Stepanova and Björn Ross -
GenBench Robust Generalization Strategies for Morpheme Glossing in an Endangered Language Documentation Context
Michael Ginn and Alexis Palmer -
GenBench Syntax-Guided Transformers: Elevating Compositional Generalization and Grounding in Multimodal Environments
Danial Kamali and Parisa Kordjamshidi -
GenBench Inductive Bias Is in the Eye of the Beholder
Michael Wilson and Robert Frank -
GenBench CBT On using distribution-based compositionality assessment to evaluate compositional generalisation in machine translation
Anssi Moisio, Mathias Creutz, and Mikko Kurimo -
GenBench Non-archival The ICL consistency test
Lucas Weber, Elia Bruni, and Dieuwke Hupkes -
GenBench Non-archival Generalizability and Robustness of Large Language Models Detecting Alzheimer’s Disease from Speech
Jekaterina Novikova -
GenBench Cross-Lingual Consistency of Factual Knowledge in Multilingual Language Models
Jirui Qi, Raquel Fernández, and Arianna Bisazza -
GenBench Non-archival The Validity of Evaluation Results: Assessing Concurrence Across Compositionality Benchmarks
Kaiser Sun, Adina Williams, and Dieuwke Hupkes -
GenBench Walking a Tightrope -- Evaluating Large Language Models in High-Risk Domains
Chia-Chien Hung, Wiem Ben Rim, Lindsay Frost, Lars Bruckner, and Carolin Lawrence -
Other Subword Segmental Machine Translation: Unifying Segmentation and Target Sentence Generation
Francois Meyer and Jan Buys -
Findings The language of prompting: What linguistic properties make a prompt successful?
Alina Leidinger, Robert Van Rooij, and Ekaterina Shutova -
Findings IRFL: Image Recognition of Figurative Language
Ron Yosef, Yonatan Bitton, and Dafna Shahaf -
Findings Three Questions Concerning the Use of Large Language Models to Facilitate Mathematics Learning
An-Zi Yen and Wei-Ling Hsu -
Findings mReFinED: An Efficient End-to-End Multilingual Entity Linking System
Peerat Limkonchotiwat, Weiwei Cheng, Christos Christodoulopoulos, Amir Saffari, and Jens Lehmann -
Findings Noisy Self-Training with Synthetic Queries for Dense Retrieval
Fan Jiang, Tom Drummond, and Trevor Cohn -
Findings Viewing Knowledge Transfer in Multilingual Machine Translation Through a Representational Lens
David Stap, Vlad Niculae, and Christof Monz -
Findings Quantifying the Dialect Gap in Large Language Models and its Causes Across Languages
Anjali Kantharuban, Ivan Vulić, and Anna Korhonen -
Findings How Predictable Are Large Language Model Capabilities? A Case Study on BIG-bench
Qinyuan Ye, Harvey Yiyun Fu, Xiang Ren, and Robin Jia -
Findings Harnessing Dataset Cartography for Improved Compositional Generalization in Transformers
Osman Batur İnce, Tanin Zeraati, Semih Yagcioglu, Yadollah Yaghoobzadeh, Erkut Erdem, and Aykut Erdem -
Findings Are Structural Concepts Universal in Transformer Language Models? Towards Interpretable Cross-Lingual Generalization
Ningyu Xu, Qi Zhang, Jingting Ye, Menghan Zhang, and Xuanjing Huang -
Findings Test-Time Self-Adaptive Small Language Models for Question Answering
Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C. Park - Findings The Less the Merrier? Investigating Language Representation in Multilingual Models
Hellina Hailu Nigatu, Atnafu Lambebo Tonja, and Jugal Kalita - Findings Test-time Augmentation for Factual Probing
Go Kamoda, Benjamin Heinzerling, Keisuke Sakaguchi, and Kentaro Inui
10:30-11:00 AM — Coffee break
11:15-12:00 PM — Keynote 2, by Adina Williams
Title: Evaluation after the LLM boom: frustrations, fallacies, and the future
12:00-12:30 PM — CBT spotlights
-
12:00-12:08 PM — GenCodeSearchNet: A Benchmark Test Suite for Evaluating Generalization in Programming Language Understanding
Andor Diera, Abdelhalim Dahou, Lukas Galke, Fabian Karl, Florian Sihler, Ansgar Scherp -
12:08-12:15 PM — Latent Feature-based Data Splits to Improve Generalisation Evaluation: A Hate Speech Detection Case Study
Maike Züfle, Verna Dankers, Ivan Titov -
12:15-12:23 PM — On using distribution-based compositionality assessment to evaluate compositional generalisation in machine translation
Anssi Moisio, Mathias Creutz, Mikko Kurimo -
12:23-12:30 PM — Cross-Lingual Consistency of Factual Knowledge in Multilingual Language Models
Jirui Qi, Raquel Fernández, Arianna Bisazza
12:30-14:00 PM — Lunch break
Afternoon programme
14:00-14:45 PM — Keynote 3, by Tatsunori Hashimoto
Title: Understanding generalization for instruction following and black-box language models
Abstract: Instruction following language models have shown a remarkable ability to perform a wide range of tasks with little to no additional training data. Do these abilities come from a revolution in pre-training and instruction-following, or are there other more mundane explanations for how these models work? In this talk, I will discuss our efforts to answer these questions by replicating instruction-following models that generalize across tasks, studying the consistency of these models across different task formats, and building tests for benchmark contamination in pretraining.
14:45-15:30 PM — Oral presentations
-
14:45-15:00 PM — Evaluating Neural Language Models as Cognitive Models of Language Acquisition
Hector Javier Vazquez Martinez, Annika Lea Heuser, Charles Yang, Jordan Kodner -
15:00-15:15 PM — Understanding Code Semantics: An Evaluation of Transformer Models in Summarization
Debanjan Mondal, Abhilasha Lodha, Ankita Sahoo, Beena Kumari -
15:15-15:30 PM — Cross-Lingual Data Augmentation For Thai Question-Answering
Parinthapat Pengpun, Can Udomcharoenchaikit, Weerayut Buaphet, Peerat Limkonchotiwat
15:30-16:00 PM — Coffee break
16:00-17:00 PM — Poster session 2 (hybrid)
Click to toggle an overview of the posters that will be presented in this session.
- GenBench 90% F1 Score in Relation Triple Extraction: Is it Real?
Pratik Saini, Samiran Pal, Tapas Nayak, Indrajit Bhattacharya - GenBench CBT mSCAN: A Dataset for Multilingual Compositional Generalisation Evaluation
Amélie Reymond, Shane Steinert-Threlkeld - GenBench CBT GQG: Generalized Quantifier Generalization - A Dataset for Evaluating Quantifier Semantics Understanding in Language Models
Leroy Zhifei Wang, Shane Steinert-Threlkeld - GenBench Non-archival Fighting Bias with Bias: Promoting Model Robustness by Amplifying Dataset Biases
Yuval Reif, Roy Schwartz - GenBench CBT GenCodeSearchNet: A Benchmark Test Suite for Evaluating Generalization in Programming Language Understanding
Andor Diera, Abdelhalim Dahou, Lukas Galke, Fabian Karl, Ansgar Scherp - GenBench CBT Latent Feature-based Data Splits to Improve Generalisation Evaluation: A Hate Speech Detection Case Study
Maike Züfle, Verna Dankers, Ivan Titov - GenBench CBT Blackbird Language Matrices Tasks for Generalization
Paola Merlo, Chunyang Jiang, Giuseppe Samo, Vivi Nastase - GenBench In-Context Learning for Text Classification with Many Labels
Aristides Milios, Siva Reddy, Dzmitry Bahdanau - GenBench CBT Shifted PAUQ: Distribution shift in text-to-SQL
Oleg Somov, Elena Tutubalina - Findings USB: A Unified Summarization Benchmark Across Tasks and Domains
Kundan Krishna, Prakhar Gupta, Sanjana Ramprasad, Byron C Wallace, Jeffrey P. Bigham, Zachary Chase Lipton - Findings Effects of Human Adversarial and Affable Samples on BERT Generalizability
Aparna Elangovan, Jiayuan He, Yuan Li, Karin Verspoor - Findings Generalizing Few-Shot Named Entity Recognizers to Unseen Domains with Type-Related Features
Zihan Wang, Ziqi Zhao, Zhumin Chen, Pengjie Ren, Maarten de Rijke, Zhaochun Ren - Findings Compositional Generalization for Data-to-Text Generation
Xinnuo Xu, Ivan Titov, Mirella Lapata - Findings Self-supervised Meta-Prompt Learning with Meta-Gradient Regularization for Few-shot Generalization
Kaihang Pan, Juncheng Li, Hongye Song, Jun Lin, Xiaozhong Liu, Siliang Tang - Findings ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning
Viet Dac Lai, Nghia Trung Ngo, Amir Pouran Ben Veyseh, Hieu Man, Franck Dernoncourt, Trung Bui, Thien Huu Nguyen - Findings XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages
Sebastian Ruder, Jonathan H. Clark, Alexander Gutkin, Mihir Kale, Min Ma, Massimo Nicosia, Shruti Rijhwani, Parker Riley et al. - Findings Orca: A Few-shot Benchmark for Chinese Conversational Machine Reading Comprehension
Nuo Chen, Hongguang Li, Junqing He, Yinan Bao, Xinshi Lin, Qi Yang, Jianfeng Liu, Ruyi Gan et al. - Findings KG-GPT: A General Framework for Reasoning on Knowledge Graphs Using Large Language Models
Jiho Kim, Yeonsu Kwon, Yohan Jo, Edward Choi - Findings PAXQA: Generating Cross-lingual Question Answering Examples at Training Scale
Bryan Li, Chris Callison-Burch - Findings Towards General Error Diagnosis via Behavioral Testing in Machine Translation
Junjie Wu, Lemao Liu, Dit-Yan Yeung - Findings Boot and Switch: Alternating Distillation for Zero-Shot Dense Retrieval
Fan Jiang, Qiongkai Xu, Tom Drummond, Trevor Cohn - Findings Estimating Large Language Model Capabilities without Labeled Test Data
Harvey Yiyun Fu, Qinyuan Ye, Albert Xu, Xiang Ren, and Robin Jia - Findings InstructExcel: A Benchmark for Natural Language Instruction in Excel
Justin Payan, Swaroop Mishra, Mukul Singh, Carina Suzana Negreanu, Christian Poelitz, Chitta Baral, Subhro Roy, Rasika Chakravarthy et al. - Findings Large Language Models' Generalization Ability Makes It a Good Source for Clinical Data Creation
Rumeng Li, Xun Wang, Hong Yu - Findings HeQ: a Large and Diverse Hebrew Reading Comprehension Benchmark
Amir David Nissan Cohen, Hilla Merhav-Fine, Yoav Goldberg, Reut Tsarfaty - Findings The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
Dung Nguyen Manh, Nam Le Hai, Anh T. V. Dau, Anh Minh Nguyen, Khanh Nghiem, Jin Guo, Nghi D. Q. Bui