BoolQ | The BoolQ benchmark for binary (yes/no) question answering [(Clark et al., 2019)](https://aclanthology.org/N19-1300/). | generation | 1000.000 | 1.000 | 1672542.565 | 4038.667 | 44.000 |
NarrativeQA | The NarrativeQA benchmark for reading comprehension over narratives [(Kočiský et al., 2017)](https://aclanthology.org/Q18-1023/). | generation | 235.000 | 2.000 | 7093437.430 | 80807.332 | 44.000 |
NaturalQuestions (closed-book) | The NaturalQuestions [(Kwiatkowski et al., 2019)](https://aclanthology.org/Q19-1026/) benchmark for question answering based on naturally-occurring queries through Google Search. The input does not include the Wikipedia page with the answer. | generation | 575.200 | 1.951 | 216416.561 | 105429.757 | 44.000 |
NaturalQuestions (open-book) | The NaturalQuestions [(Kwiatkowski et al., 2019)](https://aclanthology.org/Q19-1026/) benchmark for question answering based on naturally-occurring queries through Google Search. The input includes the Wikipedia page with the answer. | generation | 1000.000 | 2.029 | 2761054.496 | 108818.390 | 44.000 |
QuAC (Question Answering in Context) | The QuAC benchmark for question answering in the context of dialogues [(Choi et al., 2018)](https://aclanthology.org/D18-1241/). | generation | 864.200 | 3.653 | 3580198.417 | 76726.607 | 44.000 |
HellaSwag | The HellaSwag benchmark for commonsense reasoning in question answering [(Zellers et al., 2019)](https://aclanthology.org/P19-1472/). | multiple_choice_joint, multiple_choice_separate_original, multiple_choice_separate_calibrated | 615.200 | 4.000 | 38029.630 | 36827.809 | 33.000 |
OpenbookQA | The OpenbookQA benchmark for commonsense-intensive open book question answering [(Mihaylov et al., 2018)](https://aclanthology.org/D18-1260/). | multiple_choice_joint, multiple_choice_separate_original, multiple_choice_separate_calibrated | 333.750 | 4.000 | 1914.016 | 1902.622 | 33.000 |
TruthfulQA | The TruthfulQA benchmarking for measuring model truthfulness and commonsense knowledge in question answering [(Lin et al., 2022)](https://aclanthology.org/2022.acl-long.229/). | multiple_choice_joint, multiple_choice_separate_original, multiple_choice_separate_calibrated | 383.576 | 4.892 | 945086.795 | 1983.367 | 44.000 |
MMLU (Massive Multitask Language Understanding) | The Massive Multitask Language Understanding (MMLU) benchmark for knowledge-intensive question answering across 57 domains [(Hendrycks et al., 2021)](https://openreview.net/forum?id=d7KBjmI3GmQ). | multiple_choice_joint, multiple_choice_separate_original, multiple_choice_separate_calibrated | 49.727 | 4.000 | 5975052.037 | 12876.705 | 44.000 |
MS MARCO (regular track) | The MS MARCO benchmark's regular track for passage retrieval in information retrieval [(https://microsoft.github.io/msmarco/)](https://microsoft.github.io/msmarco/). | ranking_binary | 480.800 | 30.482 | 700258.555 | 2786.000 | 32.000 |
MS MARCO (TREC track) | The MS MARCO benchmark's deep learning TREC track for passage retrieval in information retrieval [(https://trec.nist.gov)](https://microsoft.github.io/msmarco/). | ranking_binary | 33.333 | 230.013 | 424303.513 | 1685.316 | 33.000 |
CNN/DailyMail | The CNN/DailyMail benchmark for text summarization ([Hermann et al., 2015](https://papers.nips.cc/paper/2015/hash/afdec7005cc9f14302cd0474fd0f3c96-Abstract.html); [Nallapati et al.,2016](https://aclanthology.org/K16-1028/)). | generation | 500.000 | 1.000 | 2265341.228 | 126902.819 | 44.000 |
XSUM | The XSUM benchmark for text summarization of BBC news articles [(Narayan et al., 2018)](https://aclanthology.org/D18-1206/). | generation | 500.000 | 1.000 | 2200682.056 | 47596.685 | 44.000 |
IMDB | The IMDB benchmark for sentiment analysis in movie review [(Maas et al., 2011)](https://aclanthology.org/P11-1015/). | generation | 743.115 | 1.000 | 2851833.209 | 4035.539 | 44.000 |
RAFT (Real-world Annotated Few-Shot) | The Real-world annotated few-shot (RAFT) meta-benchmark of 11 real-world text classification tasks [(Alex et al., 2021)](https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/ca46c1b9512a7a8315fa3c5a946e8265-Abstract-round2.html). | generation | 32.878 | 1.000 | 13799633.953 | 100921.380 | 44.000 |
CivilComments | The CivilComments benchmark for toxicity detection [(Borkan et al., 2019)](https://arxiv.org/pdf/1903.04561.pdf). | generation | 250.000 | 1.000 | 47990259.024 | 143001.569 | 44.000 |
ICE (International Corpus of English) | The International Corpus of English (ICE) drawn from English speakers from various places in the world, initiated by [Greenbaum (1991)](https://www.cambridge.org/core/journals/english-today/article/abs/ice-the-international-corpus-of-english/47808205394C538393C3FD8E62E5E701). | language_modeling | 489.833 | 0.000 | 433348.637 | 419973.750 | 33.000 |
The Pile | The Pile corpus for measuring lanugage model performance across various domains [(Gao et al., 2020)](https://arxiv.org/pdf/2101.00027.pdf). | language_modeling | 476.092 | 0.000 | 279433.102 | 265327.096 | 33.000 |
TwitterAAE | The TwitterAAE corpus of [Blodgett et al. (2016)](https://aclanthology.org/D16-1120/) for measuring language model performance in tweets as a function of speaker dialect. | language_modeling | 1000.000 | 0.000 | 1026.181 | 1004.371 | 33.000 |
TwitterAAE (AA) | The TwitterAAE corpus of [Blodgett et al. (2016)](https://aclanthology.org/D16-1120/) for measuring language model performance on African-American-aligned Tweets. | language_modeling | 1000.000 | 0.000 | 467.948 | 457.978 | 33.000 |
TwitterAAE (white) | The TwitterAAE corpus of [Blodgett et al. (2016)](https://aclanthology.org/D16-1120/) for measuring language model performance on White-aligned Tweets. | language_modeling | 1000.000 | 0.000 | 558.233 | 546.393 | 33.000 |
BLiMP (The Benchmark of Linguistic Minimal Pairs for English) | The Benchmark of Linguistic Minimal Pairs for English (BLiMP) for measuring performance on linguistic phenomena using minimal pair design [(Warstadt et al., 2020)](https://aclanthology.org/2020.tacl-1.25/). | multiple_choice_joint, multiple_choice_separate_original, multiple_choice_separate_calibrated | 1000.000 | 2.000 | 3284.730 | 3218.811 | 33.000 |
WikiFact | Scenario introduced in this work, inspired by [Petroni et al. (2019)](https://aclanthology.org/D19-1250/), to more extensively test factual knowledge. | generation | 396.350 | 8.372 | 821113.236 | 188851.010 | 44.000 |
bAbI | The bAbI benchmark for measuring understanding and reasoning [(Weston et al., 2015)](https://arxiv.org/pdf/1502.05698.pdf). | generation | 500.000 | 1.000 | 2459660.738 | 7568.021 | 46.000 |
Dyck | Scenario testing hierarchical reasoning through the Dyck formal languages [(Suzgun et al., 2019)](https://aclanthology.org/W19-3905/). | generation | 500.000 | 1.000 | 72808.938 | 1367.694 | 46.000 |
Synthetic reasoning (abstract symbols) | Synthetic reasoning tasks defined using abstract symbols based on LIME [(Wu et al., 2021)](https://proceedings.mlr.press/v139/wu21c.html). | generation | 500.000 | 1.000 | 1153444.867 | 38203.291 | 46.000 |
Synthetic reasoning (natural language) | Synthetic reasoning tasks defined using simple natural language based on LIME [(Wu et al., 2021)](https://proceedings.mlr.press/v139/wu21c.html). | generation | 500.000 | 1.000 | 633048.741 | 9345.787 | 46.000 |
GSM8K (Grade school math word problems) | The grade school math word problems dataset (GSM8K) for testing mathematical reasoning on grade-school math problems [(Cobbe et al., 2021)](https://arxiv.org/pdf/2110.14168.pdf). | generation | 1000.000 | 1.000 | 347740.068 | 84180.954 | 46.000 |
MATH | The MATH benchmark for measuring mathematical problem solving on competition math problems [(Hendrycks et al., 2021)](https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/be83ab3ecd0db773eb2dc1b0a17836a1-Abstract-round2.html). | generation | 62.429 | 1.000 | 1128332.721 | 14327.668 | 46.000 |
MATH (chain-of-thoughts) | The MATH benchmark for measuring mathematical problem solving on competition math problems with chain-of-thoughts style reasoning [(Hendrycks et al., 2021)](https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/be83ab3ecd0db773eb2dc1b0a17836a1-Abstract-round2.html). | generation | 62.429 | 1.000 | 3384124.701 | 467073.211 | 46.000 |
HumanEval (Code) | The HumanEval benchmark for measuring functional correctness for synthesizing programs from docstrings [(Chen et al., 2021)](https://arxiv.org/pdf/2107.03374.pdf). | generation | 164.000 | 1.000 | 1022.085 | 490.756 | 2.000 |
LegalSupport | Scenario introduced in this work to measure fine-grained legal reasoning through reverse entailment. | multiple_choice_joint, multiple_choice_separate_original, multiple_choice_separate_calibrated | 500.000 | 2.000 | 471144.789 | 816.384 | 46.000 |
LSAT | The LSAT benchmark for measuring analytical reasoning on the Law School Admission Test (LSAT; [Zhong et al., 2021](https://arxiv.org/pdf/2104.06598.pdf)). | multiple_choice_joint, multiple_choice_separate_original, multiple_choice_separate_calibrated | 230.500 | 5.000 | 902591.181 | 815.961 | 46.000 |
Data imputation | Scenario from [Mei et al. (2021)](https://ieeexplore.ieee.org/document/9458712/) that tests the ability to impute missing entities in a data table. | generation | 106.000 | 1.000 | 482790.642 | 4479.659 | 44.000 |
Entity matching | Scenario from Magellan [(Konda et al., 2016)](https://dl.acm.org/doi/10.14778/3007263.3007314) that tests the ability to determine if two entities match. | generation | 233.333 | 1.000 | 2033910.145 | 4788.153 | 44.000 |
Copyright (text) | Scenario introduced in this work to measure copyright and memorization behavior for books, based off of [Carlini et al. (2021)](https://www.usenix.org/biblio-11958). | generation | 429.638 | 1.000 | 56724.531 | 1227051.309 | 44.000 |
Copyright (code) | Scenario introduced in this work to measure copyright and memorization behavior for code, based off of [Carlini et al. (2021)](https://www.usenix.org/biblio-11958). | generation | 1000.000 | 1.000 | 962.322 | 12148.506 | 2.000 |
Disinformation (reiteration) | Scenario from [Buchanan et al. (2021)](https://cset.georgetown.edu/publication/truth-lies-and-automation/) that tests the ability to reiterate disinformation content. | generation | 34.000 | 1.000 | 197902.758 | 94036.431 | 44.000 |
Disinformation (wedging) | Scenario from [Buchanan et al. (2021)](https://cset.georgetown.edu/publication/truth-lies-and-automation/) that tests the ability to generate divisive and wedging content. | generation | 2.750 | 0.000 | 81822.000 | 135649.500 | 44.000 |
BBQ (Bias Benchmark for Question Answering) | The Bias Benchmark for Question Answering (BBQ) for measuring social bias in question answering in ambiguous and unambigous context [(Parrish et al., 2022)](https://aclanthology.org/2022.findings-acl.165/). | multiple_choice_joint, multiple_choice_separate_original, multiple_choice_separate_calibrated | 1000.000 | 3.000 | 158293.821 | 389.634 | 44.000 |
BOLD (Bias in Open-Ended Language Generation Dataset) | The Bias in Open-Ended Language Generation Dataset (BOLD) for measuring biases and toxicity in open-ended language generation [(Dhamala et al., 2021)](https://dl.acm.org/doi/10.1145/3442188.3445924). | generation | 1000.000 | 0.000 | 1466.703 | 2603.415 | 44.000 |
RealToxicityPrompts | The RealToxicityPrompts dataset for measuring toxicity in prompted model generations [(Gehman et al., 2020)](https://aclanthology.org/2020.findings-emnlp.301/). | generation | 500.000 | 0.000 | 3948.524 | 25514.793 | 44.000 |
Synthetic efficiency | Scenario introduced in this work to better understand inference runtime performance of various models. | generation | 10.000 | 1.000 | 2651124.000 | 73642.500 | 39.000 |