Reasoning
Targeted evaluation of reasoning capabilities (e.g. mathematical, hierarchical).
Mean win rate
Synthetic reasoning (abstract symbols) - EM
Synthetic reasoning (natural language) - F1
bAbI - EM
Dyck - EM
GSM8K - EM
MATH - Equivalent
MATH (chain-of-thoughts) - Equivalent (chain of thought)
HumanEval (Code) - pass@1
LSAT - EM
LegalSupport - EM
Data imputation - EM
Entity matching - EM
Mean win rate
Synthetic reasoning (abstract symbols) - Denoised inference time (s)
Synthetic reasoning (natural language) - Denoised inference time (s)
bAbI - Denoised inference time (s)
Dyck - Denoised inference time (s)
GSM8K - Denoised inference time (s)
MATH - Denoised inference time (s)
MATH (chain-of-thoughts) - Denoised inference time (s)
APPS (Code) - Denoised inference time (s)
HumanEval (Code) - Denoised inference time (s)
LSAT - Denoised inference time (s)
LegalSupport - Denoised inference time (s)
Data imputation - Denoised inference time (s)
Entity matching - Denoised inference time (s)
Mean win rate
Synthetic reasoning (abstract symbols) - # eval
Synthetic reasoning (abstract symbols) - # train
Synthetic reasoning (abstract symbols) - truncated
Synthetic reasoning (abstract symbols) - # prompt tokens
Synthetic reasoning (abstract symbols) - # output tokens
Synthetic reasoning (abstract symbols) - # trials
Synthetic reasoning (natural language) - # eval
Synthetic reasoning (natural language) - # train
Synthetic reasoning (natural language) - truncated
Synthetic reasoning (natural language) - # prompt tokens
Synthetic reasoning (natural language) - # output tokens
Synthetic reasoning (natural language) - # trials
bAbI - # eval
bAbI - # train
bAbI - truncated
bAbI - # prompt tokens
bAbI - # output tokens
bAbI - # trials
Dyck - # eval
Dyck - # train
Dyck - truncated
Dyck - # prompt tokens
Dyck - # output tokens
Dyck - # trials
GSM8K - # eval
GSM8K - # train
GSM8K - truncated
GSM8K - # prompt tokens
GSM8K - # output tokens
GSM8K - # trials
MATH - # eval
MATH - # train
MATH - truncated
MATH - # prompt tokens
MATH - # output tokens
MATH - # trials
MATH (chain-of-thoughts) - # eval
MATH (chain-of-thoughts) - # train
MATH (chain-of-thoughts) - truncated
MATH (chain-of-thoughts) - # prompt tokens
MATH (chain-of-thoughts) - # output tokens
MATH (chain-of-thoughts) - # trials
APPS (Code) - # eval
APPS (Code) - # train
APPS (Code) - truncated
APPS (Code) - # prompt tokens
APPS (Code) - # output tokens
APPS (Code) - # trials
HumanEval (Code) - # eval
HumanEval (Code) - # train
HumanEval (Code) - truncated
HumanEval (Code) - # prompt tokens
HumanEval (Code) - # output tokens
HumanEval (Code) - # trials
LSAT - # eval
LSAT - # train
LSAT - truncated
LSAT - # prompt tokens
LSAT - # output tokens
LSAT - # trials
LegalSupport - # eval
LegalSupport - # train
LegalSupport - truncated
LegalSupport - # prompt tokens
LegalSupport - # output tokens
LegalSupport - # trials
Data imputation - # eval
Data imputation - # train
Data imputation - truncated
Data imputation - # prompt tokens
Data imputation - # output tokens
Data imputation - # trials
Entity matching - # eval
Entity matching - # train
Entity matching - truncated
Entity matching - # prompt tokens
Entity matching - # output tokens
Entity matching - # trials
Mean win rate
APPS (Code) - Avg. # tests passed
APPS (Code) - Strict correctness