Holistic Evaluation of Language Models (HELM)

A language model takes in text and produces text: Language model diagram

Despite their simplicity, language models are increasingly functioning as the foundation for almost all language technologies from question answering to summarization. But their immense capabilities and risks are not well understood. Holistic Evaluation of Language Models (HELM) is a living benchmark that aims to improve the transparency of language models.

Broad coverage and recognition of incompleteness. We define a taxonomy over the scenarios we would ideally like to evaluate, select scenarios and metrics to cover the space and make explicit what is missing.
Multi-metric measurement. Rather than focus on isolated metrics such as accuracy, we simultaneously measure multiple metrics (e.g., accuracy, robustness, calibration, efficiency) for each scenario, allowing analysis of tradeoffs.
Standardization. We evaluate all the models that we have access to on the same scenarios with the same adaptation strategy (e.g., prompting), allowing for controlled comparisons. Thanks to all the companies for providing API access to the limited-access and closed models and Together for providing the infrastructure to run the open models.

103 Models

AI21 Labs / J1-Jumbo v1 (178B)
AI21 Labs / J1-Large v1 (7.5B)
AI21 Labs / J1-Grande v1 (17B)
AI21 Labs / J1-Grande v2 beta (17B)
AI21 Labs / Jurassic-2 Jumbo (178B)
AI21 Labs / Jurassic-2 Grande (17B)
AI21 Labs / Jurassic-2 Large (7.5B)
Aleph Alpha / Luminous Base (13B)
Aleph Alpha / Luminous Extended (30B)
Aleph Alpha / Luminous Supreme (70B)
Anthropic / Anthropic-LM v4-s3 (52B)
Anthropic / Anthropic Claude v1.3
Anthropic / Anthropic Claude Instant V1
UC Berkeley / Koala (13B)
BigScience / BLOOM (176B)
BigScience / BLOOMZ (176B)
BigScience / T0pp (11B)
BigCode / SantaCoder (1.1B)
BigCode / StarCoder (15.5B)
Cerebras / Cerebras GPT (6.7B)
Cerebras / Cerebras GPT (13B)
Cohere / Cohere xlarge v20220609 (52.4B)
Cohere / Cohere large v20220720 (13.1B)
Cohere / Cohere medium v20220720 (6.1B)
Cohere / Cohere small v20220720 (410M)
Cohere / Cohere xlarge v20221108 (52.4B)
Cohere / Cohere medium v20221108 (6.1B)
Cohere / Cohere Command beta (6.1B)
Cohere / Cohere Command beta (52.4B)
Databricks / Dolly V2 (3B)
Databricks / Dolly V2 (7B)
Databricks / Dolly V2 (12B)
DeepMind / Gopher (280B)
DeepMind / Chinchilla (70B)
EleutherAI / GPT-J (6B)
EleutherAI / GPT-NeoX (20B)
EleutherAI / Pythia (3B)
EleutherAI / Pythia (7B)
EleutherAI / Pythia (12B)
Google / T5 (11B)
Google / UL2 (20B)
Google / Flan-T5 (11B)
Google / PaLM (540B)
HazyResearch / H3 (2.7B)
Meta / OPT-IML (175B)
Meta / OPT-IML (30B)
Meta / OPT (175B)
Meta / OPT (66B)
Meta / OPT (6.7B)
Meta / OPT (1.3B)
Meta / Galactica (120B)
Meta / Galactica (30B)
Meta / LLaMA (7B)
Stanford / Alpaca (7B)
Meta / LLaMA (7B)
Meta / LLaMA (13B)
Meta / LLaMA (30B)
Meta / LLaMA (65B)
Stability AI / StableLM-Base-Alpha (7B)
Stanford / Alpaca (7B)
Stanford / Alpaca (13B)
Stanford / Alpaca (30B)
LMSYS / Vicuna (13B)
Microsoft/NVIDIA / TNLG v2 (530B)
Microsoft/NVIDIA / TNLG v2 (6.7B)
OpenAI / davinci (175B)
OpenAI / curie (6.7B)
OpenAI / babbage (1.3B)
OpenAI / ada (350M)
OpenAI / text-davinci-003
OpenAI / text-davinci-002
OpenAI / text-davinci-001
OpenAI / text-curie-001
OpenAI / text-babbage-001
OpenAI / text-ada-001
OpenAI / gpt-4-0314
OpenAI / gpt-4-32k-0314
OpenAI / code-davinci-002
OpenAI / code-davinci-001
OpenAI / code-cushman-001 (12B)
OpenAI / gpt-3.5-turbo-0301
OpenAI / gpt-3.5-turbo-0613
OpenAI / ChatGPT
Together / GPT-JT (6B)
Together / GPT-NeoXT-Chat-Base (20B)
Together / RedPajama-INCITE-Base-v1 (3B)
Together / RedPajama-INCITE-Instruct-v1 (3B)
Together / RedPajama-INCITE-Chat-v1 (3B)
Together / RedPajama-INCITE-Base-v1 (7B)
MosaicML / MPT (7B)
MosaicML / MPT-Chat (7B)
MosaicML / MPT-Instruct (7B)
Tsinghua / CodeGen (16B)
Tsinghua / GLM (130B)
Tsinghua / CodeGeeX (13B)
Writer / Palmyra Base (5B)
Writer / Palmyra Large (20B)
Writer / InstructPalmyra (30B)
Writer / Palmyra E (30B)
Writer / Silk Road (35B)
Writer / Palmyra X (43B)
Yandex / YaLM (100B)
NVIDIA / Megatron GPT2

48 Scenarios

- MS MARCO (regular track)
- MS MARCO (TREC track)
- CNN/DailyMail
- XSUM
- BillSum
- MultiLexSum
- EurLexSum
- IMDB
- RAFT (Real-world Annotated Few-Shot)
- CivilComments
- ICE (International Corpus of English)
- The Pile
- TwitterAAE
- TwitterAAE (AA)
- TwitterAAE (white)
- BLiMP (The Benchmark of Linguistic Minimal Pairs for English)
- WikiFact
- Dyck
- Numerical reasoning

104 Metrics

- 10-bin expected calibration error
- 1-bin expected calibration error
- Max prob
- 10-bin expected calibration error
- 1-bin expected calibration error (after Platt scaling)
- 10-bin Expected Calibration Error (after Platt scaling)
- Platt Scaling Coefficient
- Platt Scaling Intercept
- Selective coverage-accuracy area
- Accuracy at 10% coverage
- Stereotypical associations (race, profession)
- Stereotypical associations (gender, profession)
- Demographic representation (race)
- Demographic representation (gender)
- Toxic fraction
- Denoised inference runtime (s)
- Estimated training emissions (kg CO2)
- Estimated training energy cost (MWh)
- Observed inference runtime (s)
- Idealized inference runtime (s)
- Denoised inference runtime (s)
- # trials
- # prompt tokens
- # output tokens
- # eval
- # train
- truncated
- SummaC
- QAFactEval
- Coverage
- Density
- Compression
- BERTScore (F1)
- HumanEval-faithfulness
- HumanEval-relevance
- HumanEval-coherence
- Avg. # tests passed
- Strict correctness
- BBQ (ambiguous)
- BBQ (unambiguous)
- Longest common prefix length
- Edit distance (Levenshtein)
- Edit similarity (Levenshtein)
- Self-BLEU
- Entropy (Monte Carlo)
- Macro-F1
- Micro-F1
- # tokens
- # bytes
- # ref
- cost
- # prompt chars
- # completion tokens
- Max output tokens
- # requests
- finish b/c length
- finish b/c stop
- finish b/c endoftext
- finish b/c unknown
- # completions
- Predicted index
- Exact match
- Quasi-exact match
- Prefix exact match
- Prefix quasi-exact match
- Exact match @5
- Quasi-exact match @5
- Log probability
- Log probability / byte
- Bits/byte
- Perplexity
- ROUGE-1
- ROUGE-2
- ROUGE-L
- BLEU-1
- BLEU-4
- F1 (set match)
- F1
- Absolute difference
- Geometric distance
- Valid fraction
- NDCG@10
- RR@10
- NDCG@20
- RR@20
- Equivalent
- Equivalent (chain of thought)
- Exact match (up to specified indicator)
- Exact match (at sets)
- Intersection over union (as sets)
- BERTScore (P)
- BERTScore (R)
- Correctness
- pass@1
- Address intended audience (human)
- Include intended goal (human)
- Correct style (human)
- Divisiveness (human)
- Toxicity (human)
- Support thesis (human)
- Style (human)
- Stereotypical associations (race, adjectives)
- Stereotypical associations (gender, adjectives)
- Average max toxicity
- max toxicity
- Batch size