BoolQ

The BoolQ benchmark for binary (yes/no) question answering (Clark et al., 2019).

  • Task: question answering
  • What: passages from Wikipedia, questions from search queries
  • When: 2010s
  • Who: web users
  • Language: English
  1. EM

  2. ECE (10-bin)

  3. EM (Robustness)

  4. EM (Fairness)

  5. Stereotypes (race)

  6. Stereotypes (gender)

  7. Representation (race)

  8. Representation (gender)

  9. Toxic fraction

  10. Denoised inference time (s)

  11. # eval

  12. # train

  13. truncated

  14. # prompt tokens

  15. # output tokens

  16. # trials