Scenarios

ScenarioTaskWhatWhoWhenLanguageDescription

BoolQ

boolq

question answeringpassages from Wikipedia, questions from search queriesweb users2010sEnglish

The BoolQ benchmark for binary (yes/no) question answering (Clark et al., 2019).

NarrativeQA

narrative_qa

question answeringpassages are books and movie scripts, questions are unknown??English

The NarrativeQA benchmark for reading comprehension over narratives (Kočiský et al., 2017).

NaturalQuestions (closed-book)

natural_qa_closedbook

question answeringpassages from Wikipedia, questions from search queriesweb users2010sEnglish

The NaturalQuestions (Kwiatkowski et al., 2019) benchmark for question answering based on naturally-occurring queries through Google Search. The input does not include the Wikipedia page with the answer.

NaturalQuestions (open-book)

natural_qa_openbook_longans

question answeringpassages from Wikipedia, questions from search queriesweb users2010sEnglish

The NaturalQuestions (Kwiatkowski et al., 2019) benchmark for question answering based on naturally-occurring queries through Google Search. The input includes the Wikipedia page with the answer.

QuAC (Question Answering in Context)

quac

question answering???English

The QuAC benchmark for question answering in the context of dialogues (Choi et al., 2018).

HellaSwag

hellaswag

question answeringcommonsense reasoning??English

The HellaSwag benchmark for commonsense reasoning in question answering (Zellers et al., 2019).

OpenbookQA

openbookqa

question answering???English

The OpenbookQA benchmark for commonsense-intensive open book question answering (Mihaylov et al., 2018).

TruthfulQA

truthful_qa

question answering???English

The TruthfulQA benchmarking for measuring model truthfulness and commonsense knowledge in question answering (Lin et al., 2022).

MMLU (Massive Multitask Language Understanding)

mmlu

question answering???English

The Massive Multitask Language Understanding (MMLU) benchmark for knowledge-intensive question answering across 57 domains (Hendrycks et al., 2021).

MS MARCO (regular track)

msmarco_regular

information retrieval???English

The MS MARCO benchmark's regular track for passage retrieval in information retrieval (https://microsoft.github.io/msmarco/).

MS MARCO (TREC track)

msmarco_trec

information retrieval???English

The MS MARCO benchmark's deep learning TREC track for passage retrieval in information retrieval (https://trec.nist.gov).

CNN/DailyMail

summarization_cnndm

summarization???English

The CNN/DailyMail benchmark for text summarization (Hermann et al., 2015; Nallapati et al.,2016).

XSUM

summarization_xsum

summarization???English

The XSUM benchmark for text summarization of BBC news articles (Narayan et al., 2018).

IMDB

imdb

sentiment analysismovie reviews??English

The IMDB benchmark for sentiment analysis in movie review (Maas et al., 2011).

RAFT (Real-world Annotated Few-Shot)

raft

text classification???English

The Real-world annotated few-shot (RAFT) meta-benchmark of 11 real-world text classification tasks (Alex et al., 2021).

CivilComments

civil_comments

toxicity classification???English

The CivilComments benchmark for toxicity detection (Borkan et al., 2019).

ICE (International Corpus of English)

ice

language modeling???English varieties from different nations

The International Corpus of English (ICE) drawn from English speakers from various places in the world, initiated by Greenbaum (1991).

The Pile

the_pile

language modeling???English, code

The Pile corpus for measuring lanugage model performance across various domains (Gao et al., 2020).

TwitterAAE

twitter_aae

language modeling???English (AAE-aligned and White-aligned)

The TwitterAAE corpus of Blodgett et al. (2016) for measuring language model performance in tweets as a function of speaker dialect.

TwitterAAE (AA)

twitter_aae_aa

language modeling???English (AAE-aligned)

The TwitterAAE corpus of Blodgett et al. (2016) for measuring language model performance on African-American-aligned Tweets.

TwitterAAE (white)

twitter_aae_white

language modeling???English (White-aligned)

The TwitterAAE corpus of Blodgett et al. (2016) for measuring language model performance on White-aligned Tweets.

BLiMP (The Benchmark of Linguistic Minimal Pairs for English)

blimp

grammaticalityconstructed minimal pair sentenceslinguists2019English

The Benchmark of Linguistic Minimal Pairs for English (BLiMP) for measuring performance on linguistic phenomena using minimal pair design (Warstadt et al., 2020).

WikiFact

wikifact

knowledge base completionentity-relation-entity triples in natural language formautomatically generated from templates?structured English

Scenario introduced in this work, inspired by Petroni et al. (2019), to more extensively test factual knowledge.

bAbI

babi_qa

question answeringreasoningsynthetic2015English

The bAbI benchmark for measuring understanding and reasoning (Weston et al., 2015).

Dyck

dyck_language

next-word predictionDyck formal languagen/an/asynthetic

Scenario testing hierarchical reasoning through the Dyck formal languages (Suzgun et al., 2019).

Numerical reasoning

numeracy

next-word predictionDyck formal languagen/an/asynthetic

Scenario introduced in this work to test numerical reasoning via symbolic regression.

Synthetic reasoning (abstract symbols)

synthetic_reasoning

?n/an/an/asynthetic

Synthetic reasoning tasks defined using abstract symbols based on LIME (Wu et al., 2021).

Synthetic reasoning (natural language)

synthetic_reasoning_natural

?n/an/an/asynthetic

Synthetic reasoning tasks defined using simple natural language based on LIME (Wu et al., 2021).

GSM8K (Grade school math word problems)

gsm

?n/an/an/asynthetic

The grade school math word problems dataset (GSM8K) for testing mathematical reasoning on grade-school math problems (Cobbe et al., 2021).

MATH

math_regular

?n/an/an/asynthetic

The MATH benchmark for measuring mathematical problem solving on competition math problems (Hendrycks et al., 2021).

MATH (chain-of-thoughts)

math_chain_of_thought

?n/an/an/asynthetic

The MATH benchmark for measuring mathematical problem solving on competition math problems with chain-of-thoughts style reasoning (Hendrycks et al., 2021).

APPS (Code)

code_apps

?n/an/an/asynthetic

The APPS benchmark for measuring competence on code challenges (Hendrycks et al., 2021).

HumanEval (Code)

code_humaneval

?n/an/an/asynthetic

The HumanEval benchmark for measuring functional correctness for synthesizing programs from docstrings (Chen et al., 2021).

LegalSupport

legal_support

?n/an/an/asynthetic

Scenario introduced in this work to measure fine-grained legal reasoning through reverse entailment.

LSAT

lsat_qa

?n/an/an/asynthetic

The LSAT benchmark for measuring analytical reasoning on the Law School Admission Test (LSAT; Zhong et al., 2021).

LEXTREME

lextreme

A Multilingual Legal Benchmark for Natural Language Understanding

LexGLUE

lex_glue

A Benchmark Dataset for Legal Language Understanding in English

BillSum

billsum_legal_summarization

summarizationlegal text from US billslawyersEnglish

The BillSum benchmark for legal text summarization (Kornilova & Eidelmann, 2020).

MultiLexSum

multilexsum_legal_summarization

summarizationlegal text from US civil rights lawsuitslawyersEnglish

The MultiLexSum benchmark for legal text summarization (Shen et al., 2022).

EurLexSum

eurlexsum_legal_summarization

summarizationlegal text from EU legislationlawyers1960 - 2020English

The EurLexSum benchmark for legal text summarization (Aumiller et al., 2022).

Data imputation

entity_data_imputation

?n/an/an/asynthetic

Scenario from Mei et al. (2021) that tests the ability to impute missing entities in a data table.

Entity matching

entity_matching

?n/an/an/asynthetic

Scenario from Magellan (Konda et al., 2016) that tests the ability to determine if two entities match.

Copyright (text)

copyright_text

?n/an/an/asynthetic

Scenario introduced in this work to measure copyright and memorization behavior for books, based off of Carlini et al. (2021).

Copyright (code)

copyright_code

?n/an/an/asynthetic

Scenario introduced in this work to measure copyright and memorization behavior for code, based off of Carlini et al. (2021).

Disinformation (reiteration)

disinformation_reiteration

?n/an/an/asynthetic

Scenario from Buchanan et al. (2021) that tests the ability to reiterate disinformation content.

Disinformation (wedging)

disinformation_wedging

?n/an/an/asynthetic

Scenario from Buchanan et al. (2021) that tests the ability to generate divisive and wedging content.

BBQ (Bias Benchmark for Question Answering)

bbq

?n/an/an/asynthetic

The Bias Benchmark for Question Answering (BBQ) for measuring social bias in question answering in ambiguous and unambigous context (Parrish et al., 2022).

BOLD (Bias in Open-Ended Language Generation Dataset)

bold

?n/an/an/asynthetic

The Bias in Open-Ended Language Generation Dataset (BOLD) for measuring biases and toxicity in open-ended language generation (Dhamala et al., 2021).

RealToxicityPrompts

real_toxicity_prompts

?n/an/an/asynthetic

The RealToxicityPrompts dataset for measuring toxicity in prompted model generations (Gehman et al., 2020).

Synthetic efficiency

synthetic_efficiency

?n/an/an/asynthetic

Scenario introduced in this work to better understand inference runtime performance of various models.