Programme

Note that all time slots listed below are in Singapore Standard Time (GMT+8), and that all activities take place in the Central Ballroom 3.

Morning programme

09:00-09:15 AM — Opening remarks

09:15-10:00 AM — Keynote 1, by Anna Rogers

Anna Rogers Speaker

Title: A sanity check on emergent properties

Abstract: One of the frequent points in the mainstream narrative about large language models is that they have “emergent properties” (sometimes even dangerous enough to be considered existential risk to mankind). However, there is much disagreement about even the very definition of such properties. If they are understood as a kind of generalization beyond training data - as something that a model does without being explicitly trained for it - I argue that we have not in fact established the existence of any such properties, and at the moment we do not even have the methodology for doing so.

10:00-11:15 AM — Poster session 1

Click to toggle an overview of the posters that will be presented in this session.
  • GenBench Temporal Generalizability in Multimodal Misinformation Detection
    Nataliya Stepanova and Björn Ross
  • GenBench Robust Generalization Strategies for Morpheme Glossing in an Endangered Language Documentation Context
    Michael Ginn and Alexis Palmer
  • GenBench Syntax-Guided Transformers: Elevating Compositional Generalization and Grounding in Multimodal Environments
    Danial Kamali and Parisa Kordjamshidi
  • GenBench Inductive Bias Is in the Eye of the Beholder
    Michael Wilson and Robert Frank
  • GenBench CBT On using distribution-based compositionality assessment to evaluate compositional generalisation in machine translation
    Anssi Moisio, Mathias Creutz, and Mikko Kurimo
  • GenBench Non-archival The ICL consistency test
    Lucas Weber, Elia Bruni, and Dieuwke Hupkes
  • GenBench Non-archival Generalizability and Robustness of Large Language Models Detecting Alzheimer’s Disease from Speech
    Jekaterina Novikova
  • GenBench Cross-Lingual Consistency of Factual Knowledge in Multilingual Language Models
    Jirui Qi, Raquel Fernández, and Arianna Bisazza
  • GenBench Non-archival The Validity of Evaluation Results: Assessing Concurrence Across Compositionality Benchmarks
    Kaiser Sun, Adina Williams, and Dieuwke Hupkes
  • GenBench Walking a Tightrope -- Evaluating Large Language Models in High-Risk Domains
    Chia-Chien Hung, Wiem Ben Rim, Lindsay Frost, Lars Bruckner, and Carolin Lawrence
  • Other Subword Segmental Machine Translation: Unifying Segmentation and Target Sentence Generation
    Francois Meyer and Jan Buys
  • Findings The language of prompting: What linguistic properties make a prompt successful?
    Alina Leidinger, Robert Van Rooij, and Ekaterina Shutova
  • Findings IRFL: Image Recognition of Figurative Language
    Ron Yosef, Yonatan Bitton, and Dafna Shahaf
  • Findings Three Questions Concerning the Use of Large Language Models to Facilitate Mathematics Learning
    An-Zi Yen and Wei-Ling Hsu
  • Findings mReFinED: An Efficient End-to-End Multilingual Entity Linking System
    Peerat Limkonchotiwat, Weiwei Cheng, Christos Christodoulopoulos, Amir Saffari, and Jens Lehmann
  • Findings Noisy Self-Training with Synthetic Queries for Dense Retrieval
    Fan Jiang, Tom Drummond, and Trevor Cohn
  • Findings Viewing Knowledge Transfer in Multilingual Machine Translation Through a Representational Lens
    David Stap, Vlad Niculae, and Christof Monz
  • Findings Quantifying the Dialect Gap in Large Language Models and its Causes Across Languages
    Anjali Kantharuban, Ivan Vulić, and Anna Korhonen
  • Findings How Predictable Are Large Language Model Capabilities? A Case Study on BIG-bench
    Qinyuan Ye, Harvey Yiyun Fu, Xiang Ren, and Robin Jia
  • Findings Harnessing Dataset Cartography for Improved Compositional Generalization in Transformers
    Osman Batur İnce, Tanin Zeraati, Semih Yagcioglu, Yadollah Yaghoobzadeh, Erkut Erdem, and Aykut Erdem
  • Findings Are Structural Concepts Universal in Transformer Language Models? Towards Interpretable Cross-Lingual Generalization
    Ningyu Xu, Qi Zhang, Jingting Ye, Menghan Zhang, and Xuanjing Huang
  • Findings Test-Time Self-Adaptive Small Language Models for Question Answering
    Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C. Park
  • Findings The Less the Merrier? Investigating Language Representation in Multilingual Models
    Hellina Hailu Nigatu, Atnafu Lambebo Tonja, and Jugal Kalita
  • Findings Test-time Augmentation for Factual Probing
    Go Kamoda, Benjamin Heinzerling, Keisuke Sakaguchi, and Kentaro Inui

10:30-11:00 AM — Coffee break

11:15-12:00 PM — Keynote 2, by Adina Williams

Adina Williams Speaker Title: Evaluation after the LLM boom: frustrations, fallacies, and the future

12:00-12:30 PM — CBT spotlights

  • 12:00-12:08 PM — GenCodeSearchNet: A Benchmark Test Suite for Evaluating Generalization in Programming Language Understanding
    Andor Diera, Abdelhalim Dahou, Lukas Galke, Fabian Karl, Florian Sihler, Ansgar Scherp

  • 12:08-12:15 PM — Latent Feature-based Data Splits to Improve Generalisation Evaluation: A Hate Speech Detection Case Study
    Maike Züfle, Verna Dankers, Ivan Titov

  • 12:15-12:23 PM — On using distribution-based compositionality assessment to evaluate compositional generalisation in machine translation
    Anssi Moisio, Mathias Creutz, Mikko Kurimo

  • 12:23-12:30 PM — Cross-Lingual Consistency of Factual Knowledge in Multilingual Language Models
    Jirui Qi, Raquel Fernández, Arianna Bisazza

12:30-14:00 PM — Lunch break

Afternoon programme

14:00-14:45 PM — Keynote 3, by Tatsunori Hashimoto

Tatsunori Hashimoto Speaker

Title: Understanding generalization for instruction following and black-box language models

Abstract: Instruction following language models have shown a remarkable ability to perform a wide range of tasks with little to no additional training data. Do these abilities come from a revolution in pre-training and instruction-following, or are there other more mundane explanations for how these models work? In this talk, I will discuss our efforts to answer these questions by replicating instruction-following models that generalize across tasks, studying the consistency of these models across different task formats, and building tests for benchmark contamination in pretraining.

14:45-15:30 PM — Oral presentations

  • 14:45-15:00 PM — Evaluating Neural Language Models as Cognitive Models of Language Acquisition
    Hector Javier Vazquez Martinez, Annika Lea Heuser, Charles Yang, Jordan Kodner

  • 15:00-15:15 PM — Understanding Code Semantics: An Evaluation of Transformer Models in Summarization
    Debanjan Mondal, Abhilasha Lodha, Ankita Sahoo, Beena Kumari

  • 15:15-15:30 PM — Cross-Lingual Data Augmentation For Thai Question-Answering
    Parinthapat Pengpun, Can Udomcharoenchaikit, Weerayut Buaphet, Peerat Limkonchotiwat

15:30-16:00 PM — Coffee break

16:00-17:00 PM — Poster session 2 (hybrid)

Click to toggle an overview of the posters that will be presented in this session.
  • GenBench 90% F1 Score in Relation Triple Extraction: Is it Real?
    Pratik Saini, Samiran Pal, Tapas Nayak, Indrajit Bhattacharya
  • GenBench CBT mSCAN: A Dataset for Multilingual Compositional Generalisation Evaluation
    Amélie Reymond, Shane Steinert-Threlkeld
  • GenBench CBT GQG: Generalized Quantifier Generalization - A Dataset for Evaluating Quantifier Semantics Understanding in Language Models
    Leroy Zhifei Wang, Shane Steinert-Threlkeld
  • GenBench Non-archival Fighting Bias with Bias: Promoting Model Robustness by Amplifying Dataset Biases
    Yuval Reif, Roy Schwartz
  • GenBench CBT GenCodeSearchNet: A Benchmark Test Suite for Evaluating Generalization in Programming Language Understanding
    Andor Diera, Abdelhalim Dahou, Lukas Galke, Fabian Karl, Ansgar Scherp
  • GenBench CBT Latent Feature-based Data Splits to Improve Generalisation Evaluation: A Hate Speech Detection Case Study
    Maike Züfle, Verna Dankers, Ivan Titov
  • GenBench CBT Blackbird Language Matrices Tasks for Generalization
    Paola Merlo, Chunyang Jiang, Giuseppe Samo, Vivi Nastase
  • GenBench In-Context Learning for Text Classification with Many Labels
    Aristides Milios, Siva Reddy, Dzmitry Bahdanau
  • GenBench CBT Shifted PAUQ: Distribution shift in text-to-SQL
    Oleg Somov, Elena Tutubalina
  • Findings USB: A Unified Summarization Benchmark Across Tasks and Domains
    Kundan Krishna, Prakhar Gupta, Sanjana Ramprasad, Byron C Wallace, Jeffrey P. Bigham, Zachary Chase Lipton
  • Findings Effects of Human Adversarial and Affable Samples on BERT Generalizability
    Aparna Elangovan, Jiayuan He, Yuan Li, Karin Verspoor
  • Findings Generalizing Few-Shot Named Entity Recognizers to Unseen Domains with Type-Related Features
    Zihan Wang, Ziqi Zhao, Zhumin Chen, Pengjie Ren, Maarten de Rijke, Zhaochun Ren
  • Findings Compositional Generalization for Data-to-Text Generation
    Xinnuo Xu, Ivan Titov, Mirella Lapata
  • Findings Self-supervised Meta-Prompt Learning with Meta-Gradient Regularization for Few-shot Generalization
    Kaihang Pan, Juncheng Li, Hongye Song, Jun Lin, Xiaozhong Liu, Siliang Tang
  • Findings ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning
    Viet Dac Lai, Nghia Trung Ngo, Amir Pouran Ben Veyseh, Hieu Man, Franck Dernoncourt, Trung Bui, Thien Huu Nguyen
  • Findings XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages
    Sebastian Ruder, Jonathan H. Clark, Alexander Gutkin, Mihir Kale, Min Ma, Massimo Nicosia, Shruti Rijhwani, Parker Riley et al.
  • Findings Orca: A Few-shot Benchmark for Chinese Conversational Machine Reading Comprehension
    Nuo Chen, Hongguang Li, Junqing He, Yinan Bao, Xinshi Lin, Qi Yang, Jianfeng Liu, Ruyi Gan et al.
  • Findings KG-GPT: A General Framework for Reasoning on Knowledge Graphs Using Large Language Models
    Jiho Kim, Yeonsu Kwon, Yohan Jo, Edward Choi
  • Findings PAXQA: Generating Cross-lingual Question Answering Examples at Training Scale
    Bryan Li, Chris Callison-Burch
  • Findings Towards General Error Diagnosis via Behavioral Testing in Machine Translation
    Junjie Wu, Lemao Liu, Dit-Yan Yeung
  • Findings Boot and Switch: Alternating Distillation for Zero-Shot Dense Retrieval
    Fan Jiang, Qiongkai Xu, Tom Drummond, Trevor Cohn
  • Findings Estimating Large Language Model Capabilities without Labeled Test Data
    Harvey Yiyun Fu, Qinyuan Ye, Albert Xu, Xiang Ren, and Robin Jia
  • Findings InstructExcel: A Benchmark for Natural Language Instruction in Excel
    Justin Payan, Swaroop Mishra, Mukul Singh, Carina Suzana Negreanu, Christian Poelitz, Chitta Baral, Subhro Roy, Rasika Chakravarthy et al.
  • Findings Large Language Models' Generalization Ability Makes It a Good Source for Clinical Data Creation
    Rumeng Li, Xun Wang, Hong Yu
  • Findings HeQ: a Large and Diverse Hebrew Reading Comprehension Benchmark
    Amir David Nissan Cohen, Hilla Merhav-Fine, Yoav Goldberg, Reut Tsarfaty
  • Findings The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
    Dung Nguyen Manh, Nam Le Hai, Anh T. V. Dau, Anh Minh Nguyen, Khanh Nghiem, Jin Guo, Nghi D. Q. Bui

17:00-17:30PM — Pannel

17:30-17:45PM — Closing remarks and best paper award