CNN/DailyMail

The CNN/DailyMail benchmark for text summarization (Hermann et al., 2015; Nallapati et al.,2016).

  • Task: summarization
  • What: ?
  • When: ?
  • Who: ?
  • Language: English
  1. ROUGE-2

  2. SummaC

  3. QAFactEval

  4. BERTScore (F1)

  5. Coverage

  6. Density

  7. Compression

  8. HumanEval-faithfulness

  9. HumanEval-relevance

  10. HumanEval-coherence

  11. Stereotypes (race)

  12. Stereotypes (gender)

  13. Representation (race)

  14. Representation (gender)

  15. Toxic fraction

  16. Denoised inference time (s)

  17. # eval

  18. # train

  19. truncated

  20. # prompt tokens

  21. # output tokens

  22. # trials

-600060012001800J1-Jumbo v1(178B)J1-Large v1(7.5B)J1-Grande v1(17B)J1-Grande v2beta (17B)Jurassic-2 Jumbo(178B)Jurassic-2Grande (17B)Jurassic-2 Large(7.5B)Luminous Base(13B)LuminousExtended (30B)LuminousSupreme (70B)Anthropic-LMv4-s3 (52B)BLOOM (176B)T0pp (11B)☠Cohere xlargev20220609(52.4B)Cohere largev20220720(13.1B)Cohere mediumv20220720(6.1B)Cohere smallv20220720(410M)Cohere xlargev20221108(52.4B)Cohere mediumv20221108(6.1B)CohereCommand beta(6.1B)CohereCommand beta(52.4B)GPT-J (6B)GPT-NeoX (20B)T5 (11B)UL2 (20B)OPT (175B)OPT (66B)TNLG v2 (530B)TNLG v2 (6.7B)davinci (175B)curie (6.7B)babbage (1.3B)ada (350M)text-davinci-003text-davinci-002text-curie-001text-babbage-001text-ada-001gpt-3.5-turbo-0301RedPajama-INCITE-Base-v1(3B)GLM (130B)InstructPalmyra(30B)Palmyra X (43B)YaLM (100B)