Workshop on Enormous Language Models

Perspectives and Benchmarks

Friday, May 7th at ICLR 2021

How to attendOverviewSpeakersCall for participationScheduleOrganizers

How to attend

The event will be livestreamed here. If you are registered for ICLR, you can find information about accessing the and Zoom webinar here (requires login). Recordings of all talks and panels will be made available after the workshop. For a schedule of events, click here.


Language models that have been trained on unlabeled text data are a cornerstone of modern natural language processing (NLP) research, and many recent state-of-the-art results in NLP were achieved by leveraging these self-supervised models. The success of this recipe is largely thanks to scalability: Better results can often be obtained by training larger models on larger amounts of unlabeled text data. This synergy is particularly fruitful thanks to the wide availability of unlabeled text data on the internet and the continual improvement of hardware accelerators for training machine learning models. Notable examples of models that make use of this scalability include RoBERTa, which attained dramatically better performance simply by training for longer; T5-11B, which achieved near-human performance on the challenging SuperGLUE benchmark by scaling to 11 billion parameters; GShard, which produced a 600-billion parameter machine translation model that supports over 100 languages with state-of-the-art accuracy; and GPT-3, which showed that scaling language models beyond 100 billion parameters could achieve strong performance on many tasks without any further task-specific training. Indeed, the "scaling laws" of these models demonstrate approximately log-linear improvements in performance over more than 6 orders of magnitude in parameter count. Naïve extrapolation of these trends suggests that a model with an additional 3-5 orders of magnitude of parameters would saturate performance on most current benchmarks.

These results place our field at a crossroads. Will scaling lead to models that outperform humans on all text-based tasks, or are there limits to the scalability of these models? Should we focus on simply scaling these models, or should we design more sophisticated architectures and training schemes? Do our current benchmark effectively test capabilities that humans can master but large language models lack? How can we address the legal and ethical issues that arise from using unstructured web crawls for training language models? What can we learn from the fields of cognition, linguistics, and philosophy as we attempt to measure the "intelligence" of machines? The goal of this workshop is to find answers to these questions by inviting a diverse group of researchers to critically examine the state of giant language models. We also hope to provide concrete evidence of the capabilities and limitations of current enormous language models through a participant-driven benchmark.

Confirmed speakers and panelists

Emily M. Bender


Noam Shazeer

Yejin Choi

Thomas Wolf

Natalie Schluter

Alison Gopnik

Emily Dinan

Nicholas Carlini

Jesse Dodge

Thomas Margoni

Mike Lewis

Call for participation

This workshop will have a non-standard submission format: Rather than submitting research papers, participants will be invited to contribute diverse tasks that they believe measure uniquely human or particularly challenging capabilities for large language models. Teams at Google and OpenAI have committed to evaluate this task set on their best-performing model architectures, across models spanning from tens of thousands through hundreds of billions or more of parameters. Researchers will also be invited to contribute and evaluate their own models on these tasks. We will analyze these experiments, and report the results at the workshop, with a particular focus on how model performance on different task types scales with model size. By inviting contributions of tasks or models, we provide a means for researchers to participate whether or not they have the (cost-prohibitive) computational resources to train giant language models. The end result will be the Beyond the Imitation Game Benchmark (BIG-bench): A novel participant-driven test of the limits of giant language models.

Accepted task authors will be invited to be co-authors on the paper announcing the benchmark. BIG-bench task submissions will be accepted until June 1, 2021, after the workshop. Find out more about BIG-bench and participate here.


Like all ICLR 2021 workshops, WELM will be held remotely on Friday, May 7th 2021. All times listed below are in the UTC-06:00 timezone ("US Mountain Time"). You can view a Google Calendar of all of the events here.

Time Event
8:45-9:00am Opening remarks
9:00-9:30am Invited talk: "Brief copyright reflections on enormous language model training" by Thomas Margoni ()
Abstract: Is the use of “data” available on the Internet for the purpose of training language models lawful, or should prior authorisation be obtained? This apparently simple question reveals the complexity of a field intersecting law, technology and the usually borderless nature of the Internet. In this talk we will focus on copyright’s international framework, including a few comparative references (mainly EU and US) to offer contextual examples. A selection of some of the most “popular” aspects will be briefly addressed, in particular taxonomies (“data” for NLP is not “data” for copyright law), rights (how many and what types of copies are made), and licenses (should only Creative Commons works be used?). Finally, some concluding remarks on the role of legal rules in favouring open, fair and accountable technological developments will be formulated.

Bio: Thomas Margoni is Research Professor of Intellectual Property Law and a member of the Borad of Directors of the Centre for IT & IP Law (CiTiP), Faculty of Law, KU Leuven (BE). His work concentrates on international, comparative and EU copyright law applied to new technologies and he is an expert on legal issues pertaining to training language models.
9:30-10:00am Invited talk: "Is brevity the soul of wit? What information to report about our data" by Jesse Dodge ()
Abstract: Natural language processing and machine learning have grown tremendously in recent years, and researchers hold myriad opinions on what to report in their papers. In this talk I will present a high-level overview of the NLP Reproducibility Checklist, which provides general recommendations for what information to report in NLP papers. Then, I will dive into an example of documenting C4, a massive unlabeled text corpus built from web-crawled data. Finally, I will introduce a framework for modeling bias in data, show that this framework recovers annotation artifacts in existing datasets, and describe a technique which can help mitigate the impact of such artifacts.

Bio: Jesse Dodge is a postdoctoral researcher at AllenAI who recently completed a PhD in Computer Science from Carnegie Mellon University. He has done extensive work into the reproducibility and reporting of research on giant language models, as well as the implications of their energy and financial cost.
10:00-10:30am Invited talk: "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜" by Emily M. Bender and Angelina McMillan-Major ()
Abstract: (Joint work with Timnit Gebru and Margaret Mitchell) The past 3 years of work in NLP have been characterized by the development and deployment of ever larger language models, especially for English. BERT, its variants, GPT-2/3, and others, most recently Switch-C, have pushed the boundaries of the possible both through architectural innovations and through sheer size. Using these pretrained models and the methodology of fine-tuning them for specific tasks, researchers have extended the state of the art on a wide array of tasks as measured by leaderboards on specific benchmarks for English. In this paper, we take a step back and ask: How big is too big? What are the possible risks associated with this technology and what paths are available for mitigating those risks? We provide recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values, and encouraging research directions beyond ever larger language models.

Bio: Emily M. Bender is a Professor of Linguistics at the University of Washington. Her research is focused on multilingual grammar engineering, the study of variation, both within and across languages, the relationship between linguistics and computational linguistics, and practical methods for promoting engagement with ethical issues in NLP. She coined the Bender Rule, co-created Data Statements, and is a co-author of the recent Stochastic Parrots 🦜 paper.
Angelina McMillan-Major is a PhD student in Computational Linguistics at the University of Washington. She is interested in methodologies for low-resource language documentation and revitalization, including machine learning methodologies, and thinking critically about the interaction between technology and language. She is a co-author of the recent Stochastic Parrots 🦜 paper.
10:30-10:45am Break to discuss talks and questions for panel #1
10:45-11:15am Invited talk: "BigScience: building a Large-Hadron-Collider in AI and NLP" by Thomas Wolf ()
Abstract: The acceleration in Artificial Intelligence (AI) and Natural Language Processing (NLP) will have a fundamental impact on society, as these technologies are at the core of the tools we use on a daily basis. A considerable part of this effort currently stems in NLP from training increasingly larger language models on increasingly larger quantities of texts. Unfortunately, the resources necessary to create the best-performing models are found mainly in industry rather than academia. This unbalance on a transformative technology poses problems, from a research advancement, environmental, ethical and societal perspective. The BigScience project aims to demonstrate another way of creating, studying, and sharing large language models and large research artifacts in general within the AI/NLP research communities. BigScience takes inspiration from scientific creation schemes existing in other scientific fields, such as CERN and the LHC in particle physics, in which open scientific collaborations facilitate the creation of large-scale artifacts useful for the entire research community. Gathering a much larger research community around the creation of these artifacts makes it possible to consider in advance the many research questions surrounding large language models (capabilities, limitations, potential improvements, bias, ethics, environmental impact, general AI/cognitive research landscape) that will be interesting to answer with the created artifacts and to reflect and prepare the tools needed to answer as many of these questions as possible. The BigScience project is seen as a proposal for an alternative way to conduct large scale science projects in a more international and inclusive way. Beyond the research artifacts created and shared, the project’s success will ultimately be measured by its long-term impact on the field: by proposing another way for large-scale collaborations inspired by the successes in fields like particle physics.

Bio: Thomas Wolf is co-founder and Chief Science Officer of HuggingFace. His team is on a mission to catalyze and democratize NLP research. Prior to HuggingFace, Thomas gained a Ph.D. in physics, and later a law degree. He worked as a physics researcher and a European Patent Attorney.
11:15-11:45am Invited talk: "Adversarial Benchmarking for Toxic Generation in Large Language Models" by Emily Dinan ()
Abstract: Large language models trained on massive unlabeled corpora of human interactions will learn patterns and mimic behaviors therein, which often include offensive or otherwise toxic behavior. In this talk, I will discuss the problem of toxic generation for large language models. We will explore why this problem is challenging from both a technical and ethical perspective. I will highlight adversarial benchmarking — with humans and models in the loop — as one possible path for making progress on these issues, among many others. Lastly, I will discuss open problems and next steps for this line of work.

Bio: Emily Dinan is a research engineer at Facebook AI. She works mainly on dialogue systems and adversarial benchmarks to better measure their capabilities. Past work along these lines include the Build it Break it Fix it and Adversarial NLI benchmarks, research into bias and other harmful behaviors in dialogue models, and open-source efforts to build large scale open-domain chatbots.
11:45-12:00pm Break to discuss talks and questions for panel #1
12:00-12:45pm Panel #1: “Bias, safety, copyright, and efficiency” with Thomas Wolf, Thomas Margoni, Emily Dinan, Natalie Schluter, and Jesse Dodge
12:45-1:07pm BIG-bench introduction and initial results by Jascha Sohl-Dickstein
1:07-1:27pm BIG-bench spotlight talks (two minutes each):
  1. WinoWhy by Hongming Zhang, Xinran Zhao
  2. Symbol Interpretation Task (SIT) by Antonio Norelli, et al.
  3. Logic Grid Puzzles by Jeremy Kim, et al.
  4. Word problems on sets and graphs by Benjamin Inden
  5. Operators by Jos Rozen
  6. GEM by Sebastian Gehrmann, et al.
  7. Logical Deduction by James Simon, Chandan Singh
  8. Cryptonite by Avia Efrat, et al.
  9. Formal Fallacies and Syllogisms by Gregor Betz, et al.
  10. Goal Step Inference by Li Zhang, et al.
1:27-2:00pm BIG-bench contributed talks (ten minutes each):
  1. Strategy QA by Mor Geva, et al.
  2. Gender Sensitivity Test English and Gender Sensitivity Test Chinese by Xudong Shen
  3. Joint talk: Linguistics Puzzles by Nathan A. Chi & Conlang Translation Problems by Rowan Jacobs, et al.
2:00-2:30pm Invited talk: "Noam's Neural Network Notation for Distributed Deep Learning" by Noam Shazeer ()
Abstract: Training enormous language models requires innovative distributed computation algorithms. I will cover the common algorithms, and present a partially novel notation for describing them.

Bio: Noam Shazeer is a principal software engineer at Google Brain. He has worked on large neural language models for many years, starting with work on training giant recurrent neural network LMs, developing the Mixture-of-Experts layer to train 100+ billion parameter models; designing the Transformer architecture, building the Mesh-TensorFlow and GShard libraries, and releasing the pre-trained T5 model.
2:30-3:00pm Invited talk: "Beyond Brute Force Scaling" by Mike Lewis ()
Abstract: Remarkable results have been achieved by using ever more compute to train even larger transformer language models. Is further progress possible without increasing the computational cost? I will focus on two promising alternative paradigms for more efficiently scaling up the capacity of language models. Firstly, I will discuss non-parametric language models, which use explicit memorization over large amounts of text. In particular, kNN-LM uses distances between representations from a pre-trained language in a nearest neighbour classifier, which can dramatically improve perplexity with no additional training. Secondly, I will describe recent work on sparse models, where only a small subset of parameters are used on any given training example. I will introduce BASE layers, which provide the first “drop in” expert layer that can be used without modifying the model training objective. I will also speculate about why our models need to be so big, and future directions for moving beyond their limitations.

Bio: Mike Lewis is a research scientist at Facebook AI. He has worked on a diverse set of large language models, including the RoBERTa model that showed the importance of large-scale pre-training, the BART sequence-to-sequence model and its multilingual counterpart mBART, and the MARGE, RAG, and k-NN LM architectures that make use of a nonparametric memory.
3:00-3:30pm Invited talk: "Privacy & Enormous Language Models" by Nicholas Carlini ()
Abstract: This talk studies the privacy implications of training enormous language models on private datasets. Given access to GPT-2, we show it is possible to extract individual examples that were used to train the model. For example, we recover the full name, address, phone number, and email address of an individual person who happened to have their information in the training dataset. Most worryingly, we find that larger (billion parameter) models memorize significantly more information than smaller (hundred million parameter) models. As models continue to scale to larger sizes and larger datasets, it will be necessary to carefully understand their propensity to memorization. Since preventing these attacks is likely to require significant advancements in private training techniques, we argue that empirically measuring the privacy of language models (or lack thereof) is an important component of the release process.

Bio: Nicholas Carlini is a research scientist at Google Brain. He studies the security and privacy of machine learning, for which he has received best paper awards at ICML and IEEE S&P. He obtained his PhD from the University of California, Berkeley in 2018.
3:30-3:45pm Break to discuss talks and questions for panel #2
3:45-4:15pm Invited talk: "Causal, counterfactual and relational inference: Can a language model be as smart as a toddler?" by Alison Gopnik ()
Abstract: I will outline recent theoretical and empirical work which shows that 2-4 year old children can make causal and counterfactual inferences, extrapolate causal functions and perform analogical reasoning in new settings with novel objects, causal systems and variables.. This allows them to draw dramatically new conclusions from small amounts of evidence. Models of human thought should be able to make similar inferences. An interesting possibility is that these capacities are related to the development of language.

Bio: Alison Gopnik is a Professor of Psychology, Affiliate Professor of Philosophy and member of the Berkeley Artificial Intelligence Research Group. Her research explores how young children come to know about the world around them. In particular, she researches how children build causal structure from patterns of data across physical, biological, and psychological domains.
4:15-4:45pm Invited talk: "David V.S. Goliath in the Era of Gigantic Neural Networks" by Yejin Choi ()
Abstract: In this talk, I'll discuss the art of battling giants, where we might find the underdogs, and what could possibly go wrong with GPT-3.

Bio: Yejin Choi is an associate professor of Computer Science & Engineering at the University of Washington with the Brett Helsel Career Development Professorship, adjunct of the Linguistics department, and affiliate of the Center for Statistics and Social Sciences. Together with her students, she developed the GROVER model for generating and detecting fake news, the UNICORN multi-task model, and many difficult benchmarks for giant language models such as CommonGen and TuringAdvice.
4:45-5:00pm Break to discuss talks and questions for panel #2
5:00-5:45pm Panel #2: “Extrapolating the capabilities of language models” with Alison Gopnik, Yejin Choi, Mike Lewis, and Emily M. Bender
5:45-6:00pm Closing remarks


Colin Raffel

Adam Roberts

Amanda Askell

Daphne Ippolito

Ethan Dyer

Guy Gur-Ari

Jared Kaplan

Jascha Sohl-Dickstein

Katherine Lee

Melanie Subbiah

Vedant Misra

Tom Brown

Ambrose Slone

Liam Fedus

Daniel Freeman

Aitor Lewkowycz

Benchmark developers

Kristen Chiafullo, Ethan Dyer, Liam Fedus, Noah Fiedel, Daniel Freeman, Guy Gur-Ari, Jaehoon Lee, Aitor Lewkowycz, Gaurav Mishra, Vedant Misra, Isaac Noble, Timothy Nguyen, Danielle Perszyk, Ambrose Slone, Jascha Sohl-Dickstein

Benchmark program committee

Kyle Aitken, Igor Babuschkin, Adam Brown, David Dohan, Ethan Dyer, Stanislav Fort, Daniel Freeman, Dar Gilboa, Anna Golubeva, Guy Gur-Ari, Jesse Michael Han, Boris Hanin, Daniel Khashabi, Aitor Lewkowycz, Harsh Mehta, Gaurav Mishra, Timothy Nguyen, Isaac Noble, Alethea Power, Ambrose Slone, Jascha Sohl-Dickstein, James Sully, Neha Wadia

Advisory committee

Samuel R. Bowman

Melanie Mitchell

Percy Liang

Yacine Jernite

formatted by Markdeep 1.13