Knowledge
Targeted evaluation of knowledge (e.g. factual, cultural, commonsense).
Mean win rate
NaturalQuestions (closed-book) - F1
HellaSwag - EM
OpenbookQA - EM
TruthfulQA - EM
MMLU - EM
WikiFact - EM
Mean win rate
NaturalQuestions (closed-book) - ECE (10-bin)
HellaSwag - ECE (10-bin)
OpenbookQA - ECE (10-bin)
TruthfulQA - ECE (10-bin)
MMLU - ECE (10-bin)
Mean win rate
NaturalQuestions (closed-book) - F1 (Robustness)
HellaSwag - EM (Robustness)
OpenbookQA - EM (Robustness)
TruthfulQA - EM (Robustness)
MMLU - EM (Robustness)
Mean win rate
NaturalQuestions (closed-book) - F1 (Fairness)
HellaSwag - EM (Fairness)
OpenbookQA - EM (Fairness)
TruthfulQA - EM (Fairness)
MMLU - EM (Fairness)
Mean win rate
NaturalQuestions (closed-book) - Stereotypes (race)
NaturalQuestions (closed-book) - Stereotypes (gender)
NaturalQuestions (closed-book) - Representation (race)
NaturalQuestions (closed-book) - Representation (gender)
Mean win rate
NaturalQuestions (closed-book) - Toxic fraction
Mean win rate
NaturalQuestions (closed-book) - Denoised inference time (s)
HellaSwag - Denoised inference time (s)
OpenbookQA - Denoised inference time (s)
TruthfulQA - Denoised inference time (s)
MMLU - Denoised inference time (s)
WikiFact - Denoised inference time (s)
Mean win rate
NaturalQuestions (closed-book) - # eval
NaturalQuestions (closed-book) - # train
NaturalQuestions (closed-book) - truncated
NaturalQuestions (closed-book) - # prompt tokens
NaturalQuestions (closed-book) - # output tokens
NaturalQuestions (closed-book) - # trials
HellaSwag - # eval
HellaSwag - # train
HellaSwag - truncated
HellaSwag - # prompt tokens
HellaSwag - # output tokens
HellaSwag - # trials
OpenbookQA - # eval
OpenbookQA - # train
OpenbookQA - truncated
OpenbookQA - # prompt tokens
OpenbookQA - # output tokens
OpenbookQA - # trials
TruthfulQA - # eval
TruthfulQA - # train
TruthfulQA - truncated
TruthfulQA - # prompt tokens
TruthfulQA - # output tokens
TruthfulQA - # trials
MMLU - # eval
MMLU - # train
MMLU - truncated
MMLU - # prompt tokens
MMLU - # output tokens
MMLU - # trials
WikiFact - # eval
WikiFact - # train
WikiFact - truncated
WikiFact - # prompt tokens
WikiFact - # output tokens
WikiFact - # trials