Scenarios

Scenario	Task	What	Who	When	Language	Description
BoolQ boolq	question answering	passages from Wikipedia, questions from search queries	web users	2010s	English	The BoolQ benchmark for binary (yes/no) question answering (Clark et al., 2019).
NarrativeQA narrative_qa	question answering	passages are books and movie scripts, questions are unknown	?	?	English	The NarrativeQA benchmark for reading comprehension over narratives (Kočiský et al., 2017).
NaturalQuestions (closed-book) natural_qa_closedbook	question answering	passages from Wikipedia, questions from search queries	web users	2010s	English	The NaturalQuestions (Kwiatkowski et al., 2019) benchmark for question answering based on naturally-occurring queries through Google Search. The input does not include the Wikipedia page with the answer.
NaturalQuestions (open-book) natural_qa_openbook_longans	question answering	passages from Wikipedia, questions from search queries	web users	2010s	English	The NaturalQuestions (Kwiatkowski et al., 2019) benchmark for question answering based on naturally-occurring queries through Google Search. The input includes the Wikipedia page with the answer.
QuAC (Question Answering in Context) quac	question answering	?	?	?	English	The QuAC benchmark for question answering in the context of dialogues (Choi et al., 2018).
HellaSwag hellaswag	question answering	commonsense reasoning	?	?	English	The HellaSwag benchmark for commonsense reasoning in question answering (Zellers et al., 2019).
OpenbookQA openbookqa	question answering	?	?	?	English	The OpenbookQA benchmark for commonsense-intensive open book question answering (Mihaylov et al., 2018).
TruthfulQA truthful_qa	question answering	?	?	?	English	The TruthfulQA benchmarking for measuring model truthfulness and commonsense knowledge in question answering (Lin et al., 2022).
MMLU (Massive Multitask Language Understanding) mmlu	question answering	?	?	?	English	The Massive Multitask Language Understanding (MMLU) benchmark for knowledge-intensive question answering across 57 domains (Hendrycks et al., 2021).
MS MARCO (regular track) msmarco_regular	information retrieval	?	?	?	English	The MS MARCO benchmark's regular track for passage retrieval in information retrieval (https://microsoft.github.io/msmarco/).
MS MARCO (TREC track) msmarco_trec	information retrieval	?	?	?	English	The MS MARCO benchmark's deep learning TREC track for passage retrieval in information retrieval (https://trec.nist.gov).
CNN/DailyMail summarization_cnndm	summarization	?	?	?	English	The CNN/DailyMail benchmark for text summarization (Hermann et al., 2015; Nallapati et al.,2016).
XSUM summarization_xsum	summarization	?	?	?	English	The XSUM benchmark for text summarization of BBC news articles (Narayan et al., 2018).
IMDB imdb	sentiment analysis	movie reviews	?	?	English	The IMDB benchmark for sentiment analysis in movie review (Maas et al., 2011).
RAFT (Real-world Annotated Few-Shot) raft	text classification	?	?	?	English	The Real-world annotated few-shot (RAFT) meta-benchmark of 11 real-world text classification tasks (Alex et al., 2021).
CivilComments civil_comments	toxicity classification	?	?	?	English	The CivilComments benchmark for toxicity detection (Borkan et al., 2019).
ICE (International Corpus of English) ice	language modeling	?	?	?	English varieties from different nations	The International Corpus of English (ICE) drawn from English speakers from various places in the world, initiated by Greenbaum (1991).
The Pile the_pile	language modeling	?	?	?	English, code	The Pile corpus for measuring lanugage model performance across various domains (Gao et al., 2020).
TwitterAAE twitter_aae	language modeling	?	?	?	English (AAE-aligned and White-aligned)	The TwitterAAE corpus of Blodgett et al. (2016) for measuring language model performance in tweets as a function of speaker dialect.
TwitterAAE (AA) twitter_aae_aa	language modeling	?	?	?	English (AAE-aligned)	The TwitterAAE corpus of Blodgett et al. (2016) for measuring language model performance on African-American-aligned Tweets.
TwitterAAE (white) twitter_aae_white	language modeling	?	?	?	English (White-aligned)	The TwitterAAE corpus of Blodgett et al. (2016) for measuring language model performance on White-aligned Tweets.
BLiMP (The Benchmark of Linguistic Minimal Pairs for English) blimp	grammaticality	constructed minimal pair sentences	linguists	2019	English	The Benchmark of Linguistic Minimal Pairs for English (BLiMP) for measuring performance on linguistic phenomena using minimal pair design (Warstadt et al., 2020).
WikiFact wikifact	knowledge base completion	entity-relation-entity triples in natural language form	automatically generated from templates	?	structured English	Scenario introduced in this work, inspired by Petroni et al. (2019), to more extensively test factual knowledge.
bAbI babi_qa	question answering	reasoning	synthetic	2015	English	The bAbI benchmark for measuring understanding and reasoning (Weston et al., 2015).
Dyck dyck_language	next-word prediction	Dyck formal language	n/a	n/a	synthetic	Scenario testing hierarchical reasoning through the Dyck formal languages (Suzgun et al., 2019).
Numerical reasoning numeracy	next-word prediction	Dyck formal language	n/a	n/a	synthetic	Scenario introduced in this work to test numerical reasoning via symbolic regression.
Synthetic reasoning (abstract symbols) synthetic_reasoning	?	n/a	n/a	n/a	synthetic	Synthetic reasoning tasks defined using abstract symbols based on LIME (Wu et al., 2021).
Synthetic reasoning (natural language) synthetic_reasoning_natural	?	n/a	n/a	n/a	synthetic	Synthetic reasoning tasks defined using simple natural language based on LIME (Wu et al., 2021).
GSM8K (Grade school math word problems) gsm	?	n/a	n/a	n/a	synthetic	The grade school math word problems dataset (GSM8K) for testing mathematical reasoning on grade-school math problems (Cobbe et al., 2021).
MATH math_regular	?	n/a	n/a	n/a	synthetic	The MATH benchmark for measuring mathematical problem solving on competition math problems (Hendrycks et al., 2021).
MATH (chain-of-thoughts) math_chain_of_thought	?	n/a	n/a	n/a	synthetic	The MATH benchmark for measuring mathematical problem solving on competition math problems with chain-of-thoughts style reasoning (Hendrycks et al., 2021).
APPS (Code) code_apps	?	n/a	n/a	n/a	synthetic	The APPS benchmark for measuring competence on code challenges (Hendrycks et al., 2021).
HumanEval (Code) code_humaneval	?	n/a	n/a	n/a	synthetic	The HumanEval benchmark for measuring functional correctness for synthesizing programs from docstrings (Chen et al., 2021).
LegalSupport legal_support	?	n/a	n/a	n/a	synthetic	Scenario introduced in this work to measure fine-grained legal reasoning through reverse entailment.
LSAT lsat_qa	?	n/a	n/a	n/a	synthetic	The LSAT benchmark for measuring analytical reasoning on the Law School Admission Test (LSAT; Zhong et al., 2021).
LEXTREME lextreme						A Multilingual Legal Benchmark for Natural Language Understanding
LexGLUE lex_glue						A Benchmark Dataset for Legal Language Understanding in English
BillSum billsum_legal_summarization	summarization	legal text from US bills	lawyers		English	The BillSum benchmark for legal text summarization (Kornilova & Eidelmann, 2020).
MultiLexSum multilexsum_legal_summarization	summarization	legal text from US civil rights lawsuits	lawyers		English	The MultiLexSum benchmark for legal text summarization (Shen et al., 2022).
EurLexSum eurlexsum_legal_summarization	summarization	legal text from EU legislation	lawyers	1960 - 2020	English	The EurLexSum benchmark for legal text summarization (Aumiller et al., 2022).
Data imputation entity_data_imputation	?	n/a	n/a	n/a	synthetic	Scenario from Mei et al. (2021) that tests the ability to impute missing entities in a data table.
Entity matching entity_matching	?	n/a	n/a	n/a	synthetic	Scenario from Magellan (Konda et al., 2016) that tests the ability to determine if two entities match.
Copyright (text) copyright_text	?	n/a	n/a	n/a	synthetic	Scenario introduced in this work to measure copyright and memorization behavior for books, based off of Carlini et al. (2021).
Copyright (code) copyright_code	?	n/a	n/a	n/a	synthetic	Scenario introduced in this work to measure copyright and memorization behavior for code, based off of Carlini et al. (2021).
Disinformation (reiteration) disinformation_reiteration	?	n/a	n/a	n/a	synthetic	Scenario from Buchanan et al. (2021) that tests the ability to reiterate disinformation content.
Disinformation (wedging) disinformation_wedging	?	n/a	n/a	n/a	synthetic	Scenario from Buchanan et al. (2021) that tests the ability to generate divisive and wedging content.
BBQ (Bias Benchmark for Question Answering) bbq	?	n/a	n/a	n/a	synthetic	The Bias Benchmark for Question Answering (BBQ) for measuring social bias in question answering in ambiguous and unambigous context (Parrish et al., 2022).
BOLD (Bias in Open-Ended Language Generation Dataset) bold	?	n/a	n/a	n/a	synthetic	The Bias in Open-Ended Language Generation Dataset (BOLD) for measuring biases and toxicity in open-ended language generation (Dhamala et al., 2021).
RealToxicityPrompts real_toxicity_prompts	?	n/a	n/a	n/a	synthetic	The RealToxicityPrompts dataset for measuring toxicity in prompted model generations (Gehman et al., 2020).
Synthetic efficiency synthetic_efficiency	?	n/a	n/a	n/a	synthetic	Scenario introduced in this work to better understand inference runtime performance of various models.