• The GenBench proceedings can be found online, the slides of the invited speakers can be found on the workshop programme page stay tuned for the videos!
  • GenBench workshop 1 is a wrap, but good news: we’ll be back next year at EMNLP!
  • The programme is online, and the papers for the poster sessions will soon be announced:
  • Deadline now passed The camera-ready deadline is coming up: October 26. You can use 1 extra page (max. 9) to integrate the reviewers’ comments.
  • Closed: We are accepting EMNLP Findings papers with a good fit, please reach out via email ( for instructions.
  • After technical difficulties with the ARR submission, we extended the submission deadline to October 11! Submissions can now be made via this Openreview page.
  • Would you like to submit to the Collaborative Benchmark Task, but you missed the August 1 sample deadline? Get in contact with us, you can still submit to the September 1 deadline!
  • The CoNLL conference now allows dual submissions with GenBench, and vice versa! Is your CoNLL submission a good fit with the GenBench mission but are you still waiting for your reviews? Also submit to GenBench! See our information about dual submissions.
  • The Collaborative Benchmarking Task is now accepting submissions, see!
  • Exciting news: thanks to our sponsor Amazon, we can now offer scholarships!
  • Our call for papers is online!

Best paper and data awards

We granted both a best paper and best data data award.

  • Best paper: Andor Diera, Abdelhalim Dahou, Lukas Galke, Fabian Karl, Florian Sihler, Ansgar Scherp. GenCodeSearchNet: A benchmark test suite for evaluating generalization in programming language understanding.
  • Best data: Jirui Qi, Raquel Fernández, Arianna Bisazza. Cross-lingual consistency of factual knowledge in multilingual language models


The First GenBench workshop was be held at EMNLP 2023 in Singapore on December 6.

Important dates

  • August 1, 2023 – Sample data submission deadline
  • September 1, 2023 – Paper submission deadline, submission link:
  • September 15, 2023 extended to October 11, 2023 – ARR commitment deadline, submission link:
  • October 6, 2023 extended to October 11, 2023 – Notification deadline for direct submissions
  • October 13, 2023 - Notification deadline for ARR submissions
  • October 26, 2023 – Camera-ready deadline (you can use up to 9 pages for your camera-ready paper!)
  • December 6, 2023 – Workshop

Note: all deadlines are 11:59PM UTC-12:00

Workshop description

The ability to generalise well is often mentioned as one of the primary desiderata for models of natural language processing (e.g. Marcus, 1998; Schmidhuber, 1990; Wong and Wang, 2007; Lake et al., 2017; Yogatama et al., 2019; Linzen, 2020; Elangovan et al., 2021; Marcus, 2018). Generalisation is crucial to ensure that models behave robustly, reliably and fairly when making predictions about data that is different from the data that they were trained on and is also important when NLP models are considered from a cognitive perspective, as models of human language. Yet, what good generalisation entails and how it should be evaluated is not well understood, nor are there any common standards to evaluate it (Hupkes et al., 2022). As a result, it is difficult to understand what the current state of the field is when it comes to generalisation. It is difficult to understand how results in this area relate to each other, what sorts of generalisation are being addressed and which are neglected, which forms of generalisation testing we should prioritise in which types of scenarios, and how we can adequately assess generalisation in the first place. Missing answers to all of those questions are standing in the way of better model development: what we cannot measure, we cannot improve. The GenBench workshop on (benchmarking) generalisation in NLP aims to serve as a cornerstone to catalyse research on generalisation in the NLP community. The workshop has two concrete goals:

  • Bring together different expert communities to discuss challenging questions relating to generalisation in NLP.
  • Establish a shared platform for state-of-the-art generalisation testing in NLP, with a leaderboard for a selection of tests that are created and selected not by one group, but by a larger community.

Call for papers

To reach out workshop goals, we welcome two different types of submissions: regular workshop submissions and collaborative benchmarking task submissions. The latter will consist of a data/task artefact and a companion paper motivating and evaluating the submission. In both cases, we accept archival papers and extended abstracts.

Closed: We accept all papers from the Findings of EMNLP related to our topics of interest below. Contact us via if you would like to present your EMNLP Findings paper at GenBench 2023!

Submission types

Submission type 1: generalisation and opinion papers

Towards our first goal, we invite paper submissions on a topics related to generalisation in NLP. Such submissions present work on the topic of generalisation (see examples listed below), but are not intended to be included on the GenBench evaluation platform. Regular workshop papers may be submitted as an archival paper, when they report on completed, original and unpublished research; or as a shorter extended abstract. More details on this category can be found below.

Topics of interest include, but are not limited to:

  • Opinion or position papers about generalisation and how it should be evaluated;
  • Analyses of how existing or new models generalise;
  • Empirical studies that propose new paradigms to evaluate generalisation;
  • Meta analyses that compare how results from different generalisation studies compare;
  • Meta analyses that study how different types of generalisation are related;
  • Papers that discuss how generalisation of LLMs can be evaluated without access to training data;
  • Papers that discuss why generalisation is (not) important in the era of LLMs.
  • Studies on the relationship between generalisation and fairness or robustness;

If you are unsure whether a specific topic is well-suited for submission, feel free to reach out to the organisers of the workshop at

Submission type 2: Collaborative Benchmarking Task (CBT) submissions

To achieve the second goal of our workshop, we organise a collaborative benchmarking task (CBT), in similar spirits to the BIG-Bench challenge, but focusing specifically on non-i.i.d. generalisation. We invite researchers to submit challenging and diverse generalisation tests to the GenBench CBT.

Collaborative benchmarking task submissions consist of a data/task artefact and a paper describing and motivating the submission and showcasing it on a select number of models. We accept submissions that introduce new datasets, resplits of existing datasets along particular dimensions, or in-context learning tasks, with the goal of measuring generalisation of NLP models. We especially encourage papers that attack one of the challenges presented in Hupkes et al. (2022):

  • Generalisation in LLMs, where we have no control over the training data
  • Generalisation in the context of fairness and inclusivity
  • Multilingual generalisation

Each submission should contain information about the data (URIs, format, preprocessing), model preparation (finetuning loss, ICL prompt templates), and evaluation metrics. These will be defined either in a configuration file or in code. More details about the collaborative benchmark submissions and example submissions can be found on as well as the cbt submission page.

Participants proposing previously unpublished datasets or splits may choose to submit an archival paper or an extended abstract. Generalisation evaluation datasets that have already been published elsewhere (or will be published at EMNLP 2023) can be submitted to the platform, as well, but only through an extended abstract, citing the original publication. We allow dual submissions with EMNLP, for more information, see below.

If you are in doubt whether a particular type of dataset is suitable for submission, please consult the information page on our website, or reach out to the organisers of the workshop at

All accepted generalisation test submissions will be included in the proceedings of the workshop, and we will feature a top-selection, which will be included also in the GenBench 1.0 leaderboard, on the GenBench platform. Following Big-Bench, after the workshop is finished, we aim to do a larger-scale testing with the top tests with a range of different models (Srivastava et al., 2022).

Archival vs extended abstract

Archival papers are up to 8 pages excluding references and report on completed, original and unpublished research. They follow the requirements of regular EMNLP 2023 submissions. Accepted papers will be published in the workshop proceedings and are expected to be presented at the workshop. The papers will undergo double-blind peer-review and should thus be anonymised. Extended abstracts can be up to 2 pages excluding references, and may report on work in progress or be cross submissions of work that has already appeared in another venue. Abstract titles will be posted on the workshop website, but will not be included in the proceedings.

Submission instructions

For both archival papers and extended abstracts, we refer to the EMNLP 2023 website for paper templates. Collaborative benchmarking tasks should be submitted on the cbt submission page, an accompanying paper should be submitted through OpenReview. Regular workshop papers are submitted through OpenReview.

Submission link: Submissions are now closed, except for ARR commitment submissions. Submit via by October 1st.

Dual submissions

We allow dual submissions with both EMNLP and CoNLL, and we encourage relevant papers that were dual-submitted and accepted at EMNLP to redirect to a non-archival extended abstract submission. We furthermore welcome submissions of extended abstracts that describe work already presented at an earlier venue, both in the collaborative benchmarking and in the regular submission tracks.


We do not have an anonymity deadline, preprints are allowed, both before the submission deadline as well as after.


Our intended workshop programme consists of different elements:

  • invited presentations
  • spotlight presentation of type 1 submissions
  • oral presentations of a selection of type 2 submissions
  • poster presentations of all submissions
  • a panel on generalisation, bringing together experts from different communities

In the panel, we will discuss topics such as how to best involve domain experts in the design of generalisation tests, the future of generalisation testing and when generalisation testing is important and when it is not. Furthermore, we will add topics drawn from the workshop submissions, as well as questions solicited through an online poll prior to the workshop.

The timetable can now be found here:

Invited speakers

Adina Williams

Adina Williams Speaker

Adina is a Research Scientist at Meta on the Fundamental AI Research (FAIR) team in NYC. Her research spans several topics in NLP and computational linguistics, with a focus on dataset creation and model evaluation for humanlikeness, fairness, generalization and robustness.

Tatsunori Hashimoto

Tatsunori Hashimoto Speaker

Tatsunori Hashimoto is an Assistant Professor in the Computer Science Department at Stanford University. He is a member of the statistical machine learning and natural language processing groups at Stanford, and his research uses tools from statistics to make machine learning systems more robust and trustworthy — especially in complex systems such as large language models. He is a Kavli fellow, a Sony and Amazon research award winner, and his work has been recognized with best paper awards at ICML and CHI. Before becoming an Assistant Professor, he was a postdoctoral researcher at Stanford with Percy Liang and John Duchi and received his Ph.D. from MIT under the supervision of Tommi Jaakkola and David Gifford.

Anna Rogers

Anna Rogers Speaker

Dr. Anna Rogers is an assistant professor at IT University of Copenhagen working on analysis, interpretability, and evaluation of NLP models, their societal impact, and NLP research methodology.


We would like to thank Amazon for sponsoring our workshop.

Amazon Sponsor


With the support of our workshop sponsor Amazon, we are offering 6 scholarships, each covering up to $500 of travel expenses and/or (virtual) registration fees. We strongly encourage students from developing countries and marginalized communities to apply. To submit your application, please send us an email ( with the following information (the deadline is September 1):

  • your CV
  • A few motivational sentences: why do you want to attend the workshop, and how would the funds help you with that?


Dieuwke Hupkes is a research scientist at FAIR. Her primary research interest is better understanding models for NLP and how that relates to (linguistic, philosophical) knowledge about language.

Verna Dankers is a PhD student at the Centre for Doctoral Training in NLP, University of Edinburgh. Her primary research interests lie at the intersection of compositional generalisation for natural language tasks and interpretability.

Khuyagbaatar Batsuren is an Associate Professor at the National University of Mongolia. His research interest focuses on computational morphology and multilingual NLP.

Koustuv Sinha is a Research Scientist at Meta AI Research (Fundamental AI Research team). His research focuses on investigating systematicity and generalisation in natural language understanding (NLU) models, especially the state-of-the-art large language models, and developing methods to alleviate generalisation issues in production.

Amirhossein Kazemnejad is a master’s student at McGill University and Mila, where he studies the generalisation capabilities of Transformers.

Christos Christodoulopoulos is a Senior Applied Scientist at Amazon Research Cambridge, working on knowledge extraction and verification.

Ryan Cotterell is an assistant professor of computer science at ETH Zurich where he is affiliated with the Institute for Machine Learning, the AI Center, and the Media Technology Center. He primarily researches topics in natural language publishing and machine learning.

Elia Bruni is a professor of Natural Language Processing at the University of Osnabrück. His research focuses on deep learning for natural language processing and he is particularly interested in assessing the ability of neural networks to process language compositionally.

Program Committee

Lisa Beinborn, Vrije Universiteit Amsterdam
Jonathan Brophy, University of Oregon
Lisa Bylinina, University of Groningen
Benoit Crabbé, Université de Paris
Ghazi Felhi, University Paris 13
Robert Frank, Yale University
Mario Giulianelli, University of Amsterdam
Yangfeng Ji, University of Virginia
Robin Jia, University of Southern California
Richard Johansson, Chalmers University
Jenny Kunz, Linköping University
Carolin Lawrence, NEC Laboratories Europe
Alessandro Lenci, University of Pisa
Sheng Liang, Ludwig-Maximilians-Universität München
Tomasz Limisiewicz, Charles University Prague
Matthias Lindemann, University of Edinburgh
R. Thomas McCoy, Princeton University
William Merrill, New York University
Anmol Nayak, Bosch
Joakim Nivre, Uppsala University
Maria Ryskina, Massachusetts Institute of Technology
Niladri S. Chatterji, Stanford University
Hendrik Schuff, Technische Universität Darmstadt
Rico Sennrich, University of Zurich
Mattia Setzu, University of Pisa
Sanchit Sinha, University of Virginia
Shane Steinert-Threlkeld, University of Washington
Aarne Talman, Basement AI
Oskar van der Wal, University of Amsterdam
Alex Warstadt, New York University
Hao Yang, Beijing University of Post and Telecommunication
Naoki Yoshinaga, the University of Tokyo
Dylan Z Slack, University of California, Irvine
Sheng Zhang, Amazon
Yichu Zhou, Yahoo

Anti-Harassment Policy

GenBench adheres to the ACL Anti-Harassment Policy.