Question answering
In question answering, given a question and (optionally, in open-book settings) a passage, the goal is to produce the answer. QA is a general format that captures a wide range of tasks involving varying levels of world and commonsense knowledge and reasoning abilities.
Mean win rate
MMLU - EM
BoolQ - EM
NarrativeQA - F1
NaturalQuestions (closed-book) - F1
NaturalQuestions (open-book) - F1
QuAC - F1
HellaSwag - EM
OpenbookQA - EM
TruthfulQA - EM
Mean win rate
MMLU - ECE (10-bin)
BoolQ - ECE (10-bin)
NarrativeQA - ECE (10-bin)
NaturalQuestions (closed-book) - ECE (10-bin)
NaturalQuestions (open-book) - ECE (10-bin)
QuAC - ECE (10-bin)
HellaSwag - ECE (10-bin)
OpenbookQA - ECE (10-bin)
TruthfulQA - ECE (10-bin)
Mean win rate
MMLU - EM (Robustness)
BoolQ - EM (Robustness)
NarrativeQA - F1 (Robustness)
NaturalQuestions (closed-book) - F1 (Robustness)
NaturalQuestions (open-book) - F1 (Robustness)
QuAC - F1 (Robustness)
HellaSwag - EM (Robustness)
OpenbookQA - EM (Robustness)
TruthfulQA - EM (Robustness)
Mean win rate
MMLU - EM (Fairness)
BoolQ - EM (Fairness)
NarrativeQA - F1 (Fairness)
NaturalQuestions (closed-book) - F1 (Fairness)
NaturalQuestions (open-book) - F1 (Fairness)
QuAC - F1 (Fairness)
HellaSwag - EM (Fairness)
OpenbookQA - EM (Fairness)
TruthfulQA - EM (Fairness)
Mean win rate
MMLU - Denoised inference time (s)
BoolQ - Denoised inference time (s)
NarrativeQA - Denoised inference time (s)
NaturalQuestions (closed-book) - Denoised inference time (s)
NaturalQuestions (open-book) - Denoised inference time (s)
QuAC - Denoised inference time (s)
HellaSwag - Denoised inference time (s)
OpenbookQA - Denoised inference time (s)
TruthfulQA - Denoised inference time (s)
Mean win rate
MMLU - # eval
MMLU - # train
MMLU - truncated
MMLU - # prompt tokens
MMLU - # output tokens
MMLU - # trials
BoolQ - # eval
BoolQ - # train
BoolQ - truncated
BoolQ - # prompt tokens
BoolQ - # output tokens
BoolQ - # trials
NarrativeQA - # eval
NarrativeQA - # train
NarrativeQA - truncated
NarrativeQA - # prompt tokens
NarrativeQA - # output tokens
NarrativeQA - # trials
NaturalQuestions (closed-book) - # eval
NaturalQuestions (closed-book) - # train
NaturalQuestions (closed-book) - truncated
NaturalQuestions (closed-book) - # prompt tokens
NaturalQuestions (closed-book) - # output tokens
NaturalQuestions (closed-book) - # trials
NaturalQuestions (open-book) - # eval
NaturalQuestions (open-book) - # train
NaturalQuestions (open-book) - truncated
NaturalQuestions (open-book) - # prompt tokens
NaturalQuestions (open-book) - # output tokens
NaturalQuestions (open-book) - # trials
QuAC - # eval
QuAC - # train
QuAC - truncated
QuAC - # prompt tokens
QuAC - # output tokens
QuAC - # trials
HellaSwag - # eval
HellaSwag - # train
HellaSwag - truncated
HellaSwag - # prompt tokens
HellaSwag - # output tokens
HellaSwag - # trials
OpenbookQA - # eval
OpenbookQA - # train
OpenbookQA - truncated
OpenbookQA - # prompt tokens
OpenbookQA - # output tokens
OpenbookQA - # trials
TruthfulQA - # eval
TruthfulQA - # train
TruthfulQA - truncated
TruthfulQA - # prompt tokens
TruthfulQA - # output tokens
TruthfulQA - # trials
Mean win rate
BoolQ - Stereotypes (race)
BoolQ - Stereotypes (gender)
BoolQ - Representation (race)
BoolQ - Representation (gender)
NarrativeQA - Stereotypes (race)
NarrativeQA - Stereotypes (gender)
NarrativeQA - Representation (race)
NarrativeQA - Representation (gender)
NaturalQuestions (closed-book) - Stereotypes (race)
NaturalQuestions (closed-book) - Stereotypes (gender)
NaturalQuestions (closed-book) - Representation (race)
NaturalQuestions (closed-book) - Representation (gender)
NaturalQuestions (open-book) - Stereotypes (race)
NaturalQuestions (open-book) - Stereotypes (gender)
NaturalQuestions (open-book) - Representation (race)
NaturalQuestions (open-book) - Representation (gender)
QuAC - Stereotypes (race)
QuAC - Stereotypes (gender)
QuAC - Representation (race)
QuAC - Representation (gender)
Mean win rate
BoolQ - Toxic fraction
NarrativeQA - Toxic fraction
NaturalQuestions (closed-book) - Toxic fraction
NaturalQuestions (open-book) - Toxic fraction
QuAC - Toxic fraction