Holistic Evaluation of Language Models (HELM)

Mean win rate
MMLU - EM
BoolQ - EM
NarrativeQA - F1
NaturalQuestions (closed-book) - F1
NaturalQuestions (open-book) - F1
QuAC - F1
HellaSwag - EM
OpenbookQA - EM
TruthfulQA - EM
MS MARCO (regular) - RR@10
MS MARCO (TREC) - NDCG@10
CNN/DailyMail - ROUGE-2
XSUM - ROUGE-2
IMDB - EM
CivilComments - EM
RAFT - EM

Mean win rate
MMLU - ECE (10-bin)
BoolQ - ECE (10-bin)
NarrativeQA - ECE (10-bin)
NaturalQuestions (closed-book) - ECE (10-bin)
NaturalQuestions (open-book) - ECE (10-bin)
QuAC - ECE (10-bin)
HellaSwag - ECE (10-bin)
OpenbookQA - ECE (10-bin)
TruthfulQA - ECE (10-bin)
IMDB - ECE (10-bin)
CivilComments - ECE (10-bin)
RAFT - ECE (10-bin)

Mean win rate
MMLU - EM (Robustness)
BoolQ - EM (Robustness)
NarrativeQA - F1 (Robustness)
NaturalQuestions (closed-book) - F1 (Robustness)
NaturalQuestions (open-book) - F1 (Robustness)
QuAC - F1 (Robustness)
HellaSwag - EM (Robustness)
OpenbookQA - EM (Robustness)
TruthfulQA - EM (Robustness)
MS MARCO (regular) - RR@10 (Robustness)
MS MARCO (TREC) - NDCG@10 (Robustness)
IMDB - EM (Robustness)
CivilComments - EM (Robustness)
RAFT - EM (Robustness)

Mean win rate
MMLU - EM (Fairness)
BoolQ - EM (Fairness)
NarrativeQA - F1 (Fairness)
NaturalQuestions (closed-book) - F1 (Fairness)
NaturalQuestions (open-book) - F1 (Fairness)
QuAC - F1 (Fairness)
HellaSwag - EM (Fairness)
OpenbookQA - EM (Fairness)
TruthfulQA - EM (Fairness)
MS MARCO (regular) - RR@10 (Fairness)
MS MARCO (TREC) - NDCG@10 (Fairness)
IMDB - EM (Fairness)
CivilComments - EM (Fairness)
RAFT - EM (Fairness)

Mean win rate
MMLU - Denoised inference time (s)
BoolQ - Denoised inference time (s)
NarrativeQA - Denoised inference time (s)
NaturalQuestions (closed-book) - Denoised inference time (s)
NaturalQuestions (open-book) - Denoised inference time (s)
QuAC - Denoised inference time (s)
HellaSwag - Denoised inference time (s)
OpenbookQA - Denoised inference time (s)
TruthfulQA - Denoised inference time (s)
MS MARCO (regular) - Denoised inference time (s)
MS MARCO (TREC) - Denoised inference time (s)
CNN/DailyMail - Denoised inference time (s)
XSUM - Denoised inference time (s)
IMDB - Denoised inference time (s)
CivilComments - Denoised inference time (s)
RAFT - Denoised inference time (s)

Mean win rate
MMLU - # eval
MMLU - # train
MMLU - truncated
MMLU - # prompt tokens
MMLU - # output tokens
MMLU - # trials
BoolQ - # eval
BoolQ - # train
BoolQ - truncated
BoolQ - # prompt tokens
BoolQ - # output tokens
BoolQ - # trials
NarrativeQA - # eval
NarrativeQA - # train
NarrativeQA - truncated
NarrativeQA - # prompt tokens
NarrativeQA - # output tokens
NarrativeQA - # trials
NaturalQuestions (closed-book) - # eval
NaturalQuestions (closed-book) - # train
NaturalQuestions (closed-book) - truncated
NaturalQuestions (closed-book) - # prompt tokens
NaturalQuestions (closed-book) - # output tokens
NaturalQuestions (closed-book) - # trials
NaturalQuestions (open-book) - # eval
NaturalQuestions (open-book) - # train
NaturalQuestions (open-book) - truncated
NaturalQuestions (open-book) - # prompt tokens
NaturalQuestions (open-book) - # output tokens
NaturalQuestions (open-book) - # trials
QuAC - # eval
QuAC - # train
QuAC - truncated
QuAC - # prompt tokens
QuAC - # output tokens
QuAC - # trials
HellaSwag - # eval
HellaSwag - # train
HellaSwag - truncated
HellaSwag - # prompt tokens
HellaSwag - # output tokens
HellaSwag - # trials
OpenbookQA - # eval
OpenbookQA - # train
OpenbookQA - truncated
OpenbookQA - # prompt tokens
OpenbookQA - # output tokens
OpenbookQA - # trials
TruthfulQA - # eval
TruthfulQA - # train
TruthfulQA - truncated
TruthfulQA - # prompt tokens
TruthfulQA - # output tokens
TruthfulQA - # trials
MS MARCO (regular) - # eval
MS MARCO (regular) - # train
MS MARCO (regular) - truncated
MS MARCO (regular) - # prompt tokens
MS MARCO (regular) - # output tokens
MS MARCO (regular) - # trials
MS MARCO (TREC) - # eval
MS MARCO (TREC) - # train
MS MARCO (TREC) - truncated
MS MARCO (TREC) - # prompt tokens
MS MARCO (TREC) - # output tokens
MS MARCO (TREC) - # trials
CNN/DailyMail - # eval
CNN/DailyMail - # train
CNN/DailyMail - truncated
CNN/DailyMail - # prompt tokens
CNN/DailyMail - # output tokens
CNN/DailyMail - # trials
XSUM - # eval
XSUM - # train
XSUM - truncated
XSUM - # prompt tokens
XSUM - # output tokens
XSUM - # trials
IMDB - # eval
IMDB - # train
IMDB - truncated
IMDB - # prompt tokens
IMDB - # output tokens
IMDB - # trials
CivilComments - # eval
CivilComments - # train
CivilComments - truncated
CivilComments - # prompt tokens
CivilComments - # output tokens
CivilComments - # trials
RAFT - # eval
RAFT - # train
RAFT - truncated
RAFT - # prompt tokens
RAFT - # output tokens
RAFT - # trials

Mean win rate
BoolQ - Stereotypes (race)
BoolQ - Stereotypes (gender)
BoolQ - Representation (race)
BoolQ - Representation (gender)
NarrativeQA - Stereotypes (race)
NarrativeQA - Stereotypes (gender)
NarrativeQA - Representation (race)
NarrativeQA - Representation (gender)
NaturalQuestions (closed-book) - Stereotypes (race)
NaturalQuestions (closed-book) - Stereotypes (gender)
NaturalQuestions (closed-book) - Representation (race)
NaturalQuestions (closed-book) - Representation (gender)
NaturalQuestions (open-book) - Stereotypes (race)
NaturalQuestions (open-book) - Stereotypes (gender)
NaturalQuestions (open-book) - Representation (race)
NaturalQuestions (open-book) - Representation (gender)
QuAC - Stereotypes (race)
QuAC - Stereotypes (gender)
QuAC - Representation (race)
QuAC - Representation (gender)
MS MARCO (regular) - Stereotypes (race)
MS MARCO (regular) - Stereotypes (gender)
MS MARCO (regular) - Representation (race)
MS MARCO (regular) - Representation (gender)
MS MARCO (TREC) - Stereotypes (race)
MS MARCO (TREC) - Stereotypes (gender)
MS MARCO (TREC) - Representation (race)
MS MARCO (TREC) - Representation (gender)
CNN/DailyMail - Stereotypes (race)
CNN/DailyMail - Stereotypes (gender)
CNN/DailyMail - Representation (race)
CNN/DailyMail - Representation (gender)
XSUM - Stereotypes (race)
XSUM - Stereotypes (gender)
XSUM - Representation (race)
XSUM - Representation (gender)
IMDB - Stereotypes (race)
IMDB - Stereotypes (gender)
IMDB - Representation (race)
IMDB - Representation (gender)
CivilComments - Stereotypes (race)
CivilComments - Stereotypes (gender)
CivilComments - Representation (race)
CivilComments - Representation (gender)
RAFT - Stereotypes (race)
RAFT - Stereotypes (gender)
RAFT - Representation (race)
RAFT - Representation (gender)

Mean win rate
BoolQ - Toxic fraction
NarrativeQA - Toxic fraction
NaturalQuestions (closed-book) - Toxic fraction
NaturalQuestions (open-book) - Toxic fraction
QuAC - Toxic fraction
MS MARCO (regular) - Toxic fraction
MS MARCO (TREC) - Toxic fraction
CNN/DailyMail - Toxic fraction
XSUM - Toxic fraction
IMDB - Toxic fraction
CivilComments - Toxic fraction
RAFT - Toxic fraction

Mean win rate
CNN/DailyMail - SummaC
CNN/DailyMail - QAFactEval
CNN/DailyMail - BERTScore (F1)
CNN/DailyMail - Coverage
CNN/DailyMail - Density
CNN/DailyMail - Compression
CNN/DailyMail - HumanEval-faithfulness
CNN/DailyMail - HumanEval-relevance
CNN/DailyMail - HumanEval-coherence
XSUM - SummaC
XSUM - QAFactEval
XSUM - BERTScore (F1)
XSUM - Coverage
XSUM - Density
XSUM - Compression
XSUM - HumanEval-faithfulness
XSUM - HumanEval-relevance
XSUM - HumanEval-coherence