Reasoning

Targeted evaluation of reasoning capabilities (e.g. mathematical, hierarchical).

  1. Mean win rate

  2. Synthetic reasoning (abstract symbols) - EM

  3. Synthetic reasoning (natural language) - F1

  4. bAbI - EM

  5. Dyck - EM

  6. GSM8K - EM

  7. MATH - Equivalent

  8. MATH (chain-of-thoughts) - Equivalent (chain of thought)

  9. HumanEval (Code) - pass@1

  10. LSAT - EM

  11. LegalSupport - EM

  12. Data imputation - EM

  13. Entity matching - EM