Scenarios
Scenario | Task | What | Who | When | Language | Description |
---|---|---|---|---|---|---|
boolq | question answering | passages from Wikipedia, questions from search queries | web users | 2010s | English | The BoolQ benchmark for binary (yes/no) question answering (Clark et al., 2019). |
narrative_qa | question answering | passages are books and movie scripts, questions are unknown | ? | ? | English | The NarrativeQA benchmark for reading comprehension over narratives (Kočiský et al., 2017). |
NaturalQuestions (closed-book) natural_qa_closedbook | question answering | passages from Wikipedia, questions from search queries | web users | 2010s | English | The NaturalQuestions (Kwiatkowski et al., 2019) benchmark for question answering based on naturally-occurring queries through Google Search. The input does not include the Wikipedia page with the answer. |
natural_qa_openbook_longans | question answering | passages from Wikipedia, questions from search queries | web users | 2010s | English | The NaturalQuestions (Kwiatkowski et al., 2019) benchmark for question answering based on naturally-occurring queries through Google Search. The input includes the Wikipedia page with the answer. |
QuAC (Question Answering in Context) quac | question answering | ? | ? | ? | English | The QuAC benchmark for question answering in the context of dialogues (Choi et al., 2018). |
hellaswag | question answering | commonsense reasoning | ? | ? | English | The HellaSwag benchmark for commonsense reasoning in question answering (Zellers et al., 2019). |
openbookqa | question answering | ? | ? | ? | English | The OpenbookQA benchmark for commonsense-intensive open book question answering (Mihaylov et al., 2018). |
truthful_qa | question answering | ? | ? | ? | English | The TruthfulQA benchmarking for measuring model truthfulness and commonsense knowledge in question answering (Lin et al., 2022). |
MMLU (Massive Multitask Language Understanding) mmlu | question answering | ? | ? | ? | English | The Massive Multitask Language Understanding (MMLU) benchmark for knowledge-intensive question answering across 57 domains (Hendrycks et al., 2021). |
msmarco_regular | information retrieval | ? | ? | ? | English | The MS MARCO benchmark's regular track for passage retrieval in information retrieval (https://microsoft.github.io/msmarco/). |
msmarco_trec | information retrieval | ? | ? | ? | English | The MS MARCO benchmark's deep learning TREC track for passage retrieval in information retrieval (https://trec.nist.gov). |
summarization_cnndm | summarization | ? | ? | ? | English | The CNN/DailyMail benchmark for text summarization (Hermann et al., 2015; Nallapati et al.,2016). |
summarization_xsum | summarization | ? | ? | ? | English | The XSUM benchmark for text summarization of BBC news articles (Narayan et al., 2018). |
imdb | sentiment analysis | movie reviews | ? | ? | English | The IMDB benchmark for sentiment analysis in movie review (Maas et al., 2011). |
RAFT (Real-world Annotated Few-Shot) raft | text classification | ? | ? | ? | English | The Real-world annotated few-shot (RAFT) meta-benchmark of 11 real-world text classification tasks (Alex et al., 2021). |
civil_comments | toxicity classification | ? | ? | ? | English | The CivilComments benchmark for toxicity detection (Borkan et al., 2019). |
ICE (International Corpus of English) ice | language modeling | ? | ? | ? | English varieties from different nations | The International Corpus of English (ICE) drawn from English speakers from various places in the world, initiated by Greenbaum (1991). |
the_pile | language modeling | ? | ? | ? | English, code | The Pile corpus for measuring lanugage model performance across various domains (Gao et al., 2020). |
twitter_aae | language modeling | ? | ? | ? | English (AAE-aligned and White-aligned) | The TwitterAAE corpus of Blodgett et al. (2016) for measuring language model performance in tweets as a function of speaker dialect. |
twitter_aae_aa | language modeling | ? | ? | ? | English (AAE-aligned) | The TwitterAAE corpus of Blodgett et al. (2016) for measuring language model performance on African-American-aligned Tweets. |
twitter_aae_white | language modeling | ? | ? | ? | English (White-aligned) | The TwitterAAE corpus of Blodgett et al. (2016) for measuring language model performance on White-aligned Tweets. |
BLiMP (The Benchmark of Linguistic Minimal Pairs for English) blimp | grammaticality | constructed minimal pair sentences | linguists | 2019 | English | The Benchmark of Linguistic Minimal Pairs for English (BLiMP) for measuring performance on linguistic phenomena using minimal pair design (Warstadt et al., 2020). |
wikifact | knowledge base completion | entity-relation-entity triples in natural language form | automatically generated from templates | ? | structured English | Scenario introduced in this work, inspired by Petroni et al. (2019), to more extensively test factual knowledge. |
babi_qa | question answering | reasoning | synthetic | 2015 | English | The bAbI benchmark for measuring understanding and reasoning (Weston et al., 2015). |
dyck_language | next-word prediction | Dyck formal language | n/a | n/a | synthetic | Scenario testing hierarchical reasoning through the Dyck formal languages (Suzgun et al., 2019). |
numeracy | next-word prediction | Dyck formal language | n/a | n/a | synthetic | Scenario introduced in this work to test numerical reasoning via symbolic regression. |
Synthetic reasoning (abstract symbols) synthetic_reasoning | ? | n/a | n/a | n/a | synthetic | Synthetic reasoning tasks defined using abstract symbols based on LIME (Wu et al., 2021). |
Synthetic reasoning (natural language) synthetic_reasoning_natural | ? | n/a | n/a | n/a | synthetic | Synthetic reasoning tasks defined using simple natural language based on LIME (Wu et al., 2021). |
GSM8K (Grade school math word problems) gsm | ? | n/a | n/a | n/a | synthetic | The grade school math word problems dataset (GSM8K) for testing mathematical reasoning on grade-school math problems (Cobbe et al., 2021). |
math_regular | ? | n/a | n/a | n/a | synthetic | The MATH benchmark for measuring mathematical problem solving on competition math problems (Hendrycks et al., 2021). |
math_chain_of_thought | ? | n/a | n/a | n/a | synthetic | The MATH benchmark for measuring mathematical problem solving on competition math problems with chain-of-thoughts style reasoning (Hendrycks et al., 2021). |
code_apps | ? | n/a | n/a | n/a | synthetic | The APPS benchmark for measuring competence on code challenges (Hendrycks et al., 2021). |
code_humaneval | ? | n/a | n/a | n/a | synthetic | The HumanEval benchmark for measuring functional correctness for synthesizing programs from docstrings (Chen et al., 2021). |
legal_support | ? | n/a | n/a | n/a | synthetic | Scenario introduced in this work to measure fine-grained legal reasoning through reverse entailment. |
lsat_qa | ? | n/a | n/a | n/a | synthetic | The LSAT benchmark for measuring analytical reasoning on the Law School Admission Test (LSAT; Zhong et al., 2021). |
lextreme | A Multilingual Legal Benchmark for Natural Language Understanding | |||||
lex_glue | A Benchmark Dataset for Legal Language Understanding in English | |||||
billsum_legal_summarization | summarization | legal text from US bills | lawyers | English | The BillSum benchmark for legal text summarization (Kornilova & Eidelmann, 2020). | |
multilexsum_legal_summarization | summarization | legal text from US civil rights lawsuits | lawyers | English | The MultiLexSum benchmark for legal text summarization (Shen et al., 2022). | |
eurlexsum_legal_summarization | summarization | legal text from EU legislation | lawyers | 1960 - 2020 | English | The EurLexSum benchmark for legal text summarization (Aumiller et al., 2022). |
entity_data_imputation | ? | n/a | n/a | n/a | synthetic | Scenario from Mei et al. (2021) that tests the ability to impute missing entities in a data table. |
entity_matching | ? | n/a | n/a | n/a | synthetic | Scenario from Magellan (Konda et al., 2016) that tests the ability to determine if two entities match. |
copyright_text | ? | n/a | n/a | n/a | synthetic | Scenario introduced in this work to measure copyright and memorization behavior for books, based off of Carlini et al. (2021). |
copyright_code | ? | n/a | n/a | n/a | synthetic | Scenario introduced in this work to measure copyright and memorization behavior for code, based off of Carlini et al. (2021). |
disinformation_reiteration | ? | n/a | n/a | n/a | synthetic | Scenario from Buchanan et al. (2021) that tests the ability to reiterate disinformation content. |
disinformation_wedging | ? | n/a | n/a | n/a | synthetic | Scenario from Buchanan et al. (2021) that tests the ability to generate divisive and wedging content. |
BBQ (Bias Benchmark for Question Answering) bbq | ? | n/a | n/a | n/a | synthetic | The Bias Benchmark for Question Answering (BBQ) for measuring social bias in question answering in ambiguous and unambigous context (Parrish et al., 2022). |
BOLD (Bias in Open-Ended Language Generation Dataset) bold | ? | n/a | n/a | n/a | synthetic | The Bias in Open-Ended Language Generation Dataset (BOLD) for measuring biases and toxicity in open-ended language generation (Dhamala et al., 2021). |
real_toxicity_prompts | ? | n/a | n/a | n/a | synthetic | The RealToxicityPrompts dataset for measuring toxicity in prompted model generations (Gehman et al., 2020). |
synthetic_efficiency | ? | n/a | n/a | n/a | synthetic | Scenario introduced in this work to better understand inference runtime performance of various models. |