• We are experiencing technical difficulties with the ARR submission, and will therefore extend the deadline. Stay tuned for more information!
  • Our ARR commitment deadline has been extended to October 1! Submissions can now be made via this Openreview page.
  • Would you like to submit to the Collaborative Benchmark Task, but you missed the August 1 sample deadline? Get in contact with us, you can still submit to the September 1 deadline!
  • The CoNLL conference now allows dual submissions with GenBench, and vice versa! Is your CoNLL submission a good fit with the GenBench mission but are you still waiting for your reviews? Also submit to GenBench! See our information about dual submissions.
  • The Collaborative Benchmarking Task is now accepting submissions, see!
  • Exciting news: thanks to our sponsor Amazon, we can now offer scholarships!
  • Our call for papers is online!


The First GenBench workshop will be held at EMNLP 2023 in Singapore on December 6.

Important dates

Note: all deadlines are 11:59PM UTC-12:00

Workshop description

The ability to generalise well is often mentioned as one of the primary desiderata for models of natural language processing (e.g. Marcus, 1998; Schmidhuber, 1990; Wong and Wang, 2007; Lake et al., 2017; Yogatama et al., 2019; Linzen, 2020; Elangovan et al., 2021; Marcus, 2018). Generalisation is crucial to ensure that models behave robustly, reliably and fairly when making predictions about data that is different from the data that they were trained on and is also important when NLP models are considered from a cognitive perspective, as models of human language. Yet, what good generalisation entails and how it should be evaluated is not well understood, nor are there any common standards to evaluate it (Hupkes et al., 2022). As a result, it is difficult to understand what the current state of the field is when it comes to generalisation. It is difficult to understand how results in this area relate to each other, what sorts of generalisation are being addressed and which are neglected, which forms of generalisation testing we should prioritise in which types of scenarios, and how we can adequately assess generalisation in the first place. Missing answers to all of those questions are standing in the way of better model development: what we cannot measure, we cannot improve. The GenBench workshop on (benchmarking) generalisation in NLP aims to serve as a cornerstone to catalyse research on generalisation in the NLP community. The workshop has two concrete goals:

  • Bring together different expert communities to discuss challenging questions relating to generalisation in NLP.
  • Establish a shared platform for state-of-the-art generalisation testing in NLP, with a leaderboard for a selection of tests that are created and selected not by one group, but by a larger community.

Call for papers

To reach out workshop goals, we welcome two different types of submissions: regular workshop submissions and collaborative benchmarking task submissions. The latter will consist of a data/task artefact and a companion paper motivating and evaluating the submission. In both cases, we accept archival papers and extended abstracts.

Submission types

Submission type 1: generalisation and opinion papers

Towards our first goal, we invite paper submissions on a topics related to generalisation in NLP. Such submissions present work on the topic of generalisation (see examples listed below), but are not intended to be included on the GenBench evaluation platform. Regular workshop papers may be submitted as an archival paper, when they report on completed, original and unpublished research; or as a shorter extended abstract. More details on this category can be found below.

Topics of interest include, but are not limited to:

  • Opinion or position papers about generalisation and how it should be evaluated;
  • Analyses of how existing or new models generalise;
  • Empirical studies that propose new paradigms to evaluate generalisation;
  • Meta analyses that compare how results from different generalisation studies compare;
  • Meta analyses that study how different types of generalisation are related;
  • Papers that discuss how generalisation of LLMs can be evaluated without access to training data;
  • Papers that discuss why generalisation is (not) important in the era of LLMs.
  • Studies on the relationship between generalisation and fairness or robustness;

If you are unsure whether a specific topic is well-suited for submission, feel free to reach out to the organisers of the workshop at

Submission type 2: Collaborative Benchmarking Task (CBT) submissions

To achieve the second goal of our workshop, we organise a collaborative benchmarking task (CBT), in similar spirits to the BIG-Bench challenge, but focusing specifically on non-i.i.d. generalisation. We invite researchers to submit challenging and diverse generalisation tests to the GenBench CBT.

Collaborative benchmarking task submissions consist of a data/task artefact and a paper describing and motivating the submission and showcasing it on a select number of models. We accept submissions that introduce new datasets, resplits of existing datasets along particular dimensions, or in-context learning tasks, with the goal of measuring generalisation of NLP models. We especially encourage papers that attack one of the challenges presented in Hupkes et al. (2022):

  • Generalisation in LLMs, where we have no control over the training data
  • Generalisation in the context of fairness and inclusivity
  • Multilingual generalisation

Each submission should contain information about the data (URIs, format, preprocessing), model preparation (finetuning loss, ICL prompt templates), and evaluation metrics. These will be defined either in a configuration file or in code. More details about the collaborative benchmark submissions and example submissions can be found on as well as the cbt submission page.

Participants proposing previously unpublished datasets or splits may choose to submit an archival paper or an extended abstract. Generalisation evaluation datasets that have already been published elsewhere (or will be published at EMNLP 2023) can be submitted to the platform, as well, but only through an extended abstract, citing the original publication. We allow dual submissions with EMNLP, for more information, see below.

If you are in doubt whether a particular type of dataset is suitable for submission, please consult the information page on our website, or reach out to the organisers of the workshop at

All accepted generalisation test submissions will be included in the proceedings of the workshop, and we will feature a top-selection, which will be included also in the GenBench 1.0 leaderboard, on the GenBench platform. Following Big-Bench, after the workshop is finished, we aim to do a larger-scale testing with the top tests with a range of different models (Srivastava et al., 2022).

Archival vs extended abstract

Archival papers are up to 8 pages excluding references and report on completed, original and unpublished research. They follow the requirements of regular EMNLP 2023 submissions. Accepted papers will be published in the workshop proceedings and are expected to be presented at the workshop. The papers will undergo double-blind peer-review and should thus be anonymised. Extended abstracts can be up to 2 pages excluding references, and may report on work in progress or be cross submissions of work that has already appeared in another venue. Abstract titles will be posted on the workshop website, but will not be included in the proceedings.

Submission instructions

For both archival papers and extended abstracts, we refer to the EMNLP 2023 website for paper templates. Collaborative benchmarking tasks should be submitted on the cbt submission page, an accompanying paper should be submitted through OpenReview. Regular workshop papers are submitted through OpenReview.

Submission link: Submissions are now closed, except for ARR commitment submissions. Submit via by October 1st.

Dual submissions

We allow dual submissions with both EMNLP and CoNLL, and we encourage relevant papers that were dual-submitted and accepted at EMNLP to redirect to a non-archival extended abstract submission. We furthermore welcome submissions of extended abstracts that describe work already presented at an earlier venue, both in the collaborative benchmarking and in the regular submission tracks.


We do not have an anonymity deadline, preprints are allowed, both before the submission deadline as well as after.


Our intended workshop programme consists of different elements:

  • invited presentations
  • spotlight presentation of type 1 submissions
  • oral presentations of a selection of type 2 submissions
  • poster presentations of all submissions
  • a panel on generalisation, bringing together experts from different communities

In the panel, we will discuss topics such as how to best involve domain experts in the design of generalisation tests, the future of generalisation testing and when generalisation testing is important and when it is not. Furthermore, we will add topics drawn from the workshop submissions, as well as questions solicited through an online poll prior to the workshop.

Invited speakers

To be announced soon!


We would like to thank Amazon for sponsoring our workshop.

Amazon Sponsor


With the support of our workshop sponsor Amazon, we are offering 6 scholarships, each covering up to $500 of travel expenses and/or (virtual) registration fees. We strongly encourage students from developing countries and marginalized communities to apply. To submit your application, please send us an email ( with the following information (the deadline is September 1):

  • your CV
  • A few motivational sentences: why do you want to attend the workshop, and how would the funds help you with that?


Dieuwke Hupkes is a research scientist at FAIR. Her primary research interest is better understanding models for NLP and how that relates to (linguistic, philosophical) knowledge about language.

Verna Dankers is a PhD student at the Centre for Doctoral Training in NLP, University of Edinburgh. Her primary research interests lie at the intersection of compositional generalisation for natural language tasks and interpretability.

Khuyagbaatar Batsuren is an Associate Professor at the National University of Mongolia. His research interest focuses on computational morphology and multilingual NLP.

Koustuv Sinha is a Research Scientist at Meta AI Research (Fundamental AI Research team). His research focuses on investigating systematicity and generalisation in natural language understanding (NLU) models, especially the state-of-the-art large language models, and developing methods to alleviate generalisation issues in production.

Amirhossein Kazemnejad is a master’s student at McGill University and Mila, where he studies the generalisation capabilities of Transformers.

Christos Christodoulopoulos is a Senior Applied Scientist at Amazon Research Cambridge, working on knowledge extraction and verification.

Ryan Cotterell is an assistant professor of computer science at ETH Zurich where he is affiliated with the Institute for Machine Learning, the AI Center, and the Media Technology Center. He primarily researches topics in natural language publishing and machine learning.

Elia Bruni is a professor of Natural Language Processing at the University of Osnabrück. His research focuses on deep learning for natural language processing and he is particularly interested in assessing the ability of neural networks to process language compositionally.

Anti-Harassment Policy

GenBench adheres to the ACL Anti-Harassment Policy.