The CNN/DailyMail benchmark for text summarization ([Hermann et al., 2015](https://papers.nips.cc/paper/2015/hash/afdec7005cc9f14302cd0474fd0f3c96-Abstract.html); [Nallapati et al.,2016](https://aclanthology.org/K16-1028/)).

summarization_cnndm

The CNN/DailyMail benchmark for text summarization ([Hermann et al., 2015](https://papers.nips.cc/paper/2015/hash/afdec7005cc9f14302cd0474fd0f3c96-Abstract.html); [Nallapati et al.,2016](https://aclanthology.org/K16-1028/)).

ROUGE-2: Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/) based on 2-gram overlap.

The CNN/DailyMail benchmark for text summarization ([Hermann et al., 2015](https://papers.nips.cc/paper/2015/hash/afdec7005cc9f14302cd0474fd0f3c96-Abstract.html); [Nallapati et al.,2016](https://aclanthology.org/K16-1028/)).

SummaC: Faithfulness scores based on the SummaC method of [Laban et al. (2022)](https://aclanthology.org/2022.tacl-1.10/).

The CNN/DailyMail benchmark for text summarization ([Hermann et al., 2015](https://papers.nips.cc/paper/2015/hash/afdec7005cc9f14302cd0474fd0f3c96-Abstract.html); [Nallapati et al.,2016](https://aclanthology.org/K16-1028/)).

QAFactEval: Faithfulness scores based on the SummaC method of [Laban et al. (2022)](https://aclanthology.org/2022.tacl-1.10/).

The CNN/DailyMail benchmark for text summarization ([Hermann et al., 2015](https://papers.nips.cc/paper/2015/hash/afdec7005cc9f14302cd0474fd0f3c96-Abstract.html); [Nallapati et al.,2016](https://aclanthology.org/K16-1028/)).

BERTScore (F1): Average BERTScore F1 [(Zhang et al., 2020)](https://openreview.net/pdf?id=SkeHuCVFDr) between model generation and reference summary.

The CNN/DailyMail benchmark for text summarization ([Hermann et al., 2015](https://papers.nips.cc/paper/2015/hash/afdec7005cc9f14302cd0474fd0f3c96-Abstract.html); [Nallapati et al.,2016](https://aclanthology.org/K16-1028/)).

Coverage: Extent to which the model-generated summaries are extractive fragments from the source document [(Grusky et al., 2018)](https://aclanthology.org/N18-1065/).

The CNN/DailyMail benchmark for text summarization ([Hermann et al., 2015](https://papers.nips.cc/paper/2015/hash/afdec7005cc9f14302cd0474fd0f3c96-Abstract.html); [Nallapati et al.,2016](https://aclanthology.org/K16-1028/)).

Density: Extent to which the model-generated summaries are extractive summaries based on the source document [(Grusky et al., 2018)](https://aclanthology.org/N18-1065/).

The CNN/DailyMail benchmark for text summarization ([Hermann et al., 2015](https://papers.nips.cc/paper/2015/hash/afdec7005cc9f14302cd0474fd0f3c96-Abstract.html); [Nallapati et al.,2016](https://aclanthology.org/K16-1028/)).

Compression: Extent to which the model-generated summaries are compressed relative to the source document [(Grusky et al., 2018)](https://aclanthology.org/N18-1065/).

The CNN/DailyMail benchmark for text summarization ([Hermann et al., 2015](https://papers.nips.cc/paper/2015/hash/afdec7005cc9f14302cd0474fd0f3c96-Abstract.html); [Nallapati et al.,2016](https://aclanthology.org/K16-1028/)).

HumanEval-faithfulness: Human evaluation score for faithfulness.

The CNN/DailyMail benchmark for text summarization ([Hermann et al., 2015](https://papers.nips.cc/paper/2015/hash/afdec7005cc9f14302cd0474fd0f3c96-Abstract.html); [Nallapati et al.,2016](https://aclanthology.org/K16-1028/)).

HumanEval-relevance: Human evaluation score for relevance.

The CNN/DailyMail benchmark for text summarization ([Hermann et al., 2015](https://papers.nips.cc/paper/2015/hash/afdec7005cc9f14302cd0474fd0f3c96-Abstract.html); [Nallapati et al.,2016](https://aclanthology.org/K16-1028/)).

HumanEval-coherence: Human evaluation score for coherence.

The CNN/DailyMail benchmark for text summarization ([Hermann et al., 2015](https://papers.nips.cc/paper/2015/hash/afdec7005cc9f14302cd0474fd0f3c96-Abstract.html); [Nallapati et al.,2016](https://aclanthology.org/K16-1028/)).

Stereotypical associations (race, profession): Measures uneven association of racial groups (Asian, Hispanic, White) with target professions. This measurement is based on cooccurence statistics between the racially-associated names (word list from [Garg et al., 2018](https://www.pnas.org/doi/10.1073/pnas.1720347115); race associations based on US Census statistics) and the target professions (word list from [Bolukbasi et al., 2016](https://papers.nips.cc/paper/2016/hash/a486cd07e4ac3d270571622f4f316ec5-Abstract.html)).

The CNN/DailyMail benchmark for text summarization ([Hermann et al., 2015](https://papers.nips.cc/paper/2015/hash/afdec7005cc9f14302cd0474fd0f3c96-Abstract.html); [Nallapati et al.,2016](https://aclanthology.org/K16-1028/)).

Stereotypical associations (gender, profession): Measures uneven association of gender groups (male, female) with target professions. This measurement is based on cooccurence statistics between the gender terms (word list from [Bolukbasi et al., 2016](https://papers.nips.cc/paper/2016/hash/a486cd07e4ac3d270571622f4f316ec5-Abstract.html)) and the target professions (word list from [Bolukbasi et al., 2016](https://papers.nips.cc/paper/2016/hash/a486cd07e4ac3d270571622f4f316ec5-Abstract.html)).

The CNN/DailyMail benchmark for text summarization ([Hermann et al., 2015](https://papers.nips.cc/paper/2015/hash/afdec7005cc9f14302cd0474fd0f3c96-Abstract.html); [Nallapati et al.,2016](https://aclanthology.org/K16-1028/)).

Demographic representation (race): Measures uneven representation of racial groups (Asian, Hispanic, White). This measurement is based on disparities in the frequency statistics across racially-associated names (word list from [Garg et al., 2018](https://www.pnas.org/doi/10.1073/pnas.1720347115); race associations based on US Census statistics).

The CNN/DailyMail benchmark for text summarization ([Hermann et al., 2015](https://papers.nips.cc/paper/2015/hash/afdec7005cc9f14302cd0474fd0f3c96-Abstract.html); [Nallapati et al.,2016](https://aclanthology.org/K16-1028/)).

Demographic representation (gender): Measures uneven representation of gender groups (male, female). This measurement is based on disparities in the frequency statistics across gender terms (word list from [Bolukbasi et al., 2016](https://papers.nips.cc/paper/2016/hash/a486cd07e4ac3d270571622f4f316ec5-Abstract.html)).

The CNN/DailyMail benchmark for text summarization ([Hermann et al., 2015](https://papers.nips.cc/paper/2015/hash/afdec7005cc9f14302cd0474fd0f3c96-Abstract.html); [Nallapati et al.,2016](https://aclanthology.org/K16-1028/)).

Toxic fraction: Fraction of model outputs that are toxic (based on the PerspectiveAPI toxicity classifier).

The CNN/DailyMail benchmark for text summarization ([Hermann et al., 2015](https://papers.nips.cc/paper/2015/hash/afdec7005cc9f14302cd0474fd0f3c96-Abstract.html); [Nallapati et al.,2016](https://aclanthology.org/K16-1028/)).

Denoised inference runtime (s): Average time to process a request to the model minus performance contention by using profiled runtimes from multiple trials of SyntheticEfficiencyScenario.

The CNN/DailyMail benchmark for text summarization ([Hermann et al., 2015](https://papers.nips.cc/paper/2015/hash/afdec7005cc9f14302cd0474fd0f3c96-Abstract.html); [Nallapati et al.,2016](https://aclanthology.org/K16-1028/)).

# eval: Number of evaluation instances.

The CNN/DailyMail benchmark for text summarization ([Hermann et al., 2015](https://papers.nips.cc/paper/2015/hash/afdec7005cc9f14302cd0474fd0f3c96-Abstract.html); [Nallapati et al.,2016](https://aclanthology.org/K16-1028/)).

# train: Number of training instances (e.g., in-context examples).

The CNN/DailyMail benchmark for text summarization ([Hermann et al., 2015](https://papers.nips.cc/paper/2015/hash/afdec7005cc9f14302cd0474fd0f3c96-Abstract.html); [Nallapati et al.,2016](https://aclanthology.org/K16-1028/)).

truncated: Fraction of instances where the prompt itself was truncated (implies that there were no in-context examples).

The CNN/DailyMail benchmark for text summarization ([Hermann et al., 2015](https://papers.nips.cc/paper/2015/hash/afdec7005cc9f14302cd0474fd0f3c96-Abstract.html); [Nallapati et al.,2016](https://aclanthology.org/K16-1028/)).

# prompt tokens: Number of tokens in the prompt.

The CNN/DailyMail benchmark for text summarization ([Hermann et al., 2015](https://papers.nips.cc/paper/2015/hash/afdec7005cc9f14302cd0474fd0f3c96-Abstract.html); [Nallapati et al.,2016](https://aclanthology.org/K16-1028/)).

# output tokens: Actual number of output tokens.

The CNN/DailyMail benchmark for text summarization ([Hermann et al., 2015](https://papers.nips.cc/paper/2015/hash/afdec7005cc9f14302cd0474fd0f3c96-Abstract.html); [Nallapati et al.,2016](https://aclanthology.org/K16-1028/)).

# trials: Number of trials, where in each trial we choose an independent, random set of training instances.

summarization_cnndm_dataset_name:cnn-dm,sampling_min_length:50,sampling_max_length:150,doc_max_length:512

min=0.137, mean=0.144, max=0.157, sum=0.861 (6)

min=0.491, mean=0.515, max=0.544, sum=1.545 (3)

min=4.661, mean=4.697, max=4.725, sum=28.182 (6)

min=0.264, mean=0.278, max=0.301, sum=0.834 (3)

min=0.965, mean=0.976, max=0.984, sum=5.856 (6)

min=40.605, mean=53.93, max=67.411, sum=323.578 (6)

min=8.981, mean=9.579, max=10.219, sum=57.476 (6)

min=0.62, mean=0.63, max=0.647, sum=3.781 (6)

min=0.382, mean=0.386, max=0.393, sum=2.314 (6)

min=0.288, mean=0.325, max=0.362, sum=1.95 (6)

min=0.13, mean=0.131, max=0.132, sum=0.788 (6)

min=0, mean=0.002, max=0.004, sum=0.013 (6)

min=3.558, mean=3.777, max=3.91, sum=22.664 (6)

min=1203.032, mean=1213.032, max=1224.032, sum=7278.193 (6)

min=67.139, mean=72.469, max=75.648, sum=434.815 (6)

min=0.123, mean=0.134, max=0.147, sum=0.802 (6)

min=0.488, mean=0.512, max=0.535, sum=1.537 (3)

min=4.664, mean=4.716, max=4.749, sum=28.295 (6)

min=0.229, mean=0.248, max=0.272, sum=0.745 (3)

min=0.971, mean=0.977, max=0.985, sum=5.861 (6)

min=55.528, mean=71.654, max=97.831, sum=429.924 (6)

min=5.872, mean=7.632, max=9.373, sum=45.79 (6)

min=0.602, mean=0.632, max=0.648, sum=3.791 (6)

min=0.385, mean=0.391, max=0.396, sum=2.349 (6)

min=0.257, mean=0.302, max=0.354, sum=1.811 (6)

min=0.135, mean=0.142, max=0.152, sum=0.851 (6)

min=0, mean=0.001, max=0.004, sum=0.009 (6)

min=1.832, mean=2.011, max=2.216, sum=12.069 (6)

min=78.521, mean=89.614, max=102.401, sum=537.682 (6)

min=0.127, mean=0.143, max=0.163, sum=0.859 (6)

min=0.514, mean=0.539, max=0.586, sum=1.617 (3)

min=4.706, mean=4.81, max=4.896, sum=28.859 (6)

min=0.247, mean=0.275, max=0.302, sum=0.824 (3)

min=0.966, mean=0.973, max=0.984, sum=5.84 (6)

min=31.118, mean=41.027, max=60.066, sum=246.163 (6)

min=8.092, mean=9.888, max=11.258, sum=59.326 (6)

min=0.608, mean=0.633, max=0.647, sum=3.801 (6)

min=0.39, mean=0.4, max=0.407, sum=2.398 (6)

min=0.263, mean=0.351, max=0.399, sum=2.104 (6)

min=0.115, mean=0.13, max=0.14, sum=0.782 (6)

min=0, mean=0.001, max=0.002, sum=0.009 (6)

min=1.956, mean=2.074, max=2.263, sum=12.445 (6)

min=61.569, mean=67.049, max=76.034, sum=402.296 (6)

min=0.14, mean=0.146, max=0.152, sum=0.875 (6)

min=0.533, mean=0.552, max=0.585, sum=1.655 (3)

min=0.273, mean=0.29, max=0.308, sum=0.871 (3)

min=0.965, mean=0.973, max=0.983, sum=5.838 (6)

min=18.643, mean=24.032, max=31.138, sum=144.19 (6)

min=10.389, mean=11.659, max=13.368, sum=69.956 (6)

min=0.605, mean=0.615, max=0.633, sum=3.691 (6)

min=0.39, mean=0.401, max=0.416, sum=2.409 (6)

min=0.278, mean=0.293, max=0.321, sum=1.76 (6)

min=0.077, mean=0.099, max=0.123, sum=0.596 (6)

min=0.002, mean=0.004, max=0.006, sum=0.026 (6)

min=48.575, mean=53.215, max=56.485, sum=319.288 (6)

min=0.142, mean=0.149, max=0.157, sum=0.892 (6)

min=0.442, mean=0.489, max=0.542, sum=1.466 (3)

min=0.299, mean=0.313, max=0.33, sum=0.94 (3)

min=0.952, mean=0.957, max=0.964, sum=5.745 (6)

min=12.535, mean=15.317, max=20.424, sum=91.904 (6)

min=11.809, mean=12.304, max=13.071, sum=73.825 (6)

min=0.593, mean=0.608, max=0.618, sum=3.649 (6)

min=0.396, mean=0.411, max=0.434, sum=2.467 (6)

min=0.177, mean=0.254, max=0.301, sum=1.526 (6)

min=0.064, mean=0.083, max=0.119, sum=0.497 (6)

min=47.208, mean=49.239, max=51.633, sum=295.433 (6)

min=0.131, mean=0.144, max=0.153, sum=0.865 (6)

min=0.469, mean=0.503, max=0.535, sum=1.508 (3)

min=0.281, mean=0.299, max=0.308, sum=0.896 (3)

min=0.953, mean=0.96, max=0.965, sum=5.76 (6)

min=14.681, mean=22.302, max=27.561, sum=133.811 (6)

min=10.404, mean=11.399, max=13.033, sum=68.391 (6)

min=0.619, mean=0.636, max=0.667, sum=3.817 (6)

min=0.386, mean=0.402, max=0.424, sum=2.411 (6)

min=0.338, mean=0.359, max=0.379, sum=2.152 (6)

min=0.099, mean=0.117, max=0.128, sum=0.701 (6)

min=0.002, mean=0.003, max=0.004, sum=0.017 (6)

min=48.987, mean=55.762, max=59.891, sum=334.571 (6)

min=0.122, mean=0.136, max=0.15, sum=0.813 (6)

min=0.466, mean=0.496, max=0.548, sum=1.488 (3)

min=0.242, mean=0.271, max=0.304, sum=0.812 (3)

min=0.952, mean=0.963, max=0.98, sum=5.779 (6)

min=15.279, mean=25.229, max=36.911, sum=151.374 (6)

min=9.923, mean=11.503, max=13.28, sum=69.017 (6)

min=0.612, mean=0.647, max=0.667, sum=3.885 (6)

min=0.365, mean=0.405, max=0.442, sum=2.432 (6)

min=0.175, mean=0.245, max=0.377, sum=1.468 (6)

min=0.103, mean=0.133, max=0.149, sum=0.796 (6)

min=52.573, mean=58.246, max=61.575, sum=349.476 (6)

min=0.048, mean=0.11, max=0.147, sum=0.661 (6)

min=-0.076, mean=0.32, max=0.527, sum=0.959 (3)

min=0.045, mean=0.188, max=0.278, sum=0.563 (3)

min=0.543, mean=0.834, max=0.982, sum=5.004 (6)

min=15.163, mean=35.663, max=51.192, sum=213.977 (6)

min=8.191, mean=9.346, max=11.345, sum=56.078 (6)

min=0.607, mean=0.629, max=0.667, sum=3.775 (6)

min=0.388, mean=0.408, max=0.443, sum=2.45 (6)

min=0.211, mean=0.287, max=0.333, sum=1.725 (6)

min=0.138, mean=0.164, max=0.192, sum=0.984 (6)

min=0, mean=0.001, max=0.002, sum=0.004 (6)

min=1564.648, mean=1578.648, max=1593.648, sum=9471.888 (6)

min=59.824, mean=80.866, max=92.721, sum=485.197 (6)

min=0.117, mean=0.139, max=0.15, sum=0.834 (6)

min=0.309, mean=0.481, max=0.569, sum=1.443 (3)

min=0.202, mean=0.255, max=0.288, sum=0.766 (3)

min=0.8, mean=0.925, max=0.989, sum=5.552 (6)

min=34.945, mean=41.619, max=45.552, sum=249.715 (6)

min=8.478, mean=9.039, max=9.909, sum=54.236 (6)

min=0.58, mean=0.608, max=0.637, sum=3.651 (6)

min=0.382, mean=0.391, max=0.398, sum=2.347 (6)

min=0.254, mean=0.274, max=0.288, sum=1.642 (6)

min=0.128, mean=0.151, max=0.191, sum=0.909 (6)

min=73.322, mean=83.112, max=88.178, sum=498.674 (6)

min=0.133, mean=0.15, max=0.16, sum=0.899 (6)

min=0.423, mean=0.552, max=0.624, sum=1.656 (3)

min=0.236, mean=0.28, max=0.304, sum=0.841 (3)

min=0.846, mean=0.939, max=0.988, sum=5.636 (6)

min=31.874, mean=33.625, max=34.739, sum=201.751 (6)

min=8.884, mean=9.298, max=9.552, sum=55.787 (6)

min=0.621, mean=0.63, max=0.646, sum=3.782 (6)

min=0.39, mean=0.401, max=0.412, sum=2.406 (6)

min=0.281, mean=0.291, max=0.297, sum=1.746 (6)

min=0.114, mean=0.13, max=0.148, sum=0.782 (6)

min=71.758, mean=75.51, max=79.294, sum=453.06 (6)

min=0.142, mean=0.154, max=0.17, sum=0.927 (6)

min=0.473, mean=0.492, max=0.515, sum=1.477 (3)

min=4.385, mean=4.692, max=4.898, sum=28.151 (6)

min=0.315, mean=0.326, max=0.342, sum=0.979 (3)

min=0.953, mean=0.96, max=0.968, sum=5.762 (6)

min=9.043, mean=10.832, max=14.179, sum=64.991 (6)

min=10.561, mean=11.89, max=12.628, sum=71.339 (6)

min=0.667, mean=0.667, max=0.667, sum=1.333 (2)

min=2.667, mean=2.667, max=2.667, sum=5.333 (2)

min=0.6, mean=0.616, max=0.642, sum=3.694 (6)

min=0.4, mean=0.412, max=0.426, sum=2.474 (6)

min=0.241, mean=0.252, max=0.26, sum=1.514 (6)

min=0.075, mean=0.093, max=0.102, sum=0.555 (6)

min=3.898, mean=4.076, max=4.414, sum=24.459 (6)

min=1531.586, mean=1549.919, max=1567.586, sum=9299.515 (6)

min=54.895, mean=58.035, max=64.039, sum=348.21 (6)

min=0.052, mean=0.08, max=0.118, sum=0.478 (6)

min=-0.129, mean=-0.02, max=0.115, sum=-0.059 (3)

min=4.63, mean=4.665, max=4.719, sum=27.988 (6)

min=0.005, mean=0.08, max=0.184, sum=0.24 (3)

min=0.618, mean=0.71, max=0.826, sum=4.26 (6)

min=20.964, mean=32.013, max=45.756, sum=192.081 (6)

min=4.623, mean=5.252, max=6.434, sum=31.514 (6)

min=0.641, mean=0.658, max=0.667, sum=3.949 (6)

min=0.372, mean=0.385, max=0.405, sum=2.311 (6)

min=0.291, mean=0.314, max=0.352, sum=1.882 (6)

min=0.119, mean=0.145, max=0.16, sum=0.872 (6)

min=5.515, mean=5.584, max=5.648, sum=33.506 (6)

min=1520.33, mean=1541.33, max=1578.33, sum=9247.983 (6)

min=104.867, mean=117.435, max=124.011, sum=704.609 (6)

T0++ is explicitly trained on these datasets, i.e. data from the same distribution as the test set. See Table 5 on page 24 of https://arxiv.org/pdf/2110.08207.pdf.

min=0.121, mean=0.122, max=0.122, sum=0.73 (6)
☠ T0++ is explicitly trained on these datasets, i.e. data from the same distribution as the test set. See Table 5 on page 24 of https://arxiv.org/pdf/2110.08207.pdf.

min=-0.052, mean=-0.044, max=-0.031, sum=-0.132 (3)
☠ T0++ is explicitly trained on these datasets, i.e. data from the same distribution as the test set. See Table 5 on page 24 of https://arxiv.org/pdf/2110.08207.pdf.

min=0.151, mean=0.155, max=0.163, sum=0.465 (3)
☠ T0++ is explicitly trained on these datasets, i.e. data from the same distribution as the test set. See Table 5 on page 24 of https://arxiv.org/pdf/2110.08207.pdf.

min=0.836, mean=0.841, max=0.845, sum=5.047 (6)
☠ T0++ is explicitly trained on these datasets, i.e. data from the same distribution as the test set. See Table 5 on page 24 of https://arxiv.org/pdf/2110.08207.pdf.

min=8.147, mean=8.588, max=8.816, sum=51.53 (6)
☠ T0++ is explicitly trained on these datasets, i.e. data from the same distribution as the test set. See Table 5 on page 24 of https://arxiv.org/pdf/2110.08207.pdf.

min=8.169, mean=8.274, max=8.416, sum=49.643 (6)
☠ T0++ is explicitly trained on these datasets, i.e. data from the same distribution as the test set. See Table 5 on page 24 of https://arxiv.org/pdf/2110.08207.pdf.

min=0.562, mean=0.594, max=0.631, sum=3.562 (6)
☠ T0++ is explicitly trained on these datasets, i.e. data from the same distribution as the test set. See Table 5 on page 24 of https://arxiv.org/pdf/2110.08207.pdf.

min=0.391, mean=0.403, max=0.421, sum=2.417 (6)
☠ T0++ is explicitly trained on these datasets, i.e. data from the same distribution as the test set. See Table 5 on page 24 of https://arxiv.org/pdf/2110.08207.pdf.

min=0.27, mean=0.277, max=0.282, sum=1.662 (6)
☠ T0++ is explicitly trained on these datasets, i.e. data from the same distribution as the test set. See Table 5 on page 24 of https://arxiv.org/pdf/2110.08207.pdf.

min=0.047, mean=0.093, max=0.138, sum=0.559 (6)
☠ T0++ is explicitly trained on these datasets, i.e. data from the same distribution as the test set. See Table 5 on page 24 of https://arxiv.org/pdf/2110.08207.pdf.

min=0, mean=0.001, max=0.002, sum=0.009 (6)
☠ T0++ is explicitly trained on these datasets, i.e. data from the same distribution as the test set. See Table 5 on page 24 of https://arxiv.org/pdf/2110.08207.pdf.

min=1.057, mean=1.066, max=1.081, sum=6.393 (6)
☠ T0++ is explicitly trained on these datasets, i.e. data from the same distribution as the test set. See Table 5 on page 24 of https://arxiv.org/pdf/2110.08207.pdf.

min=466, mean=466, max=466, sum=2796 (6)
☠ T0++ is explicitly trained on these datasets, i.e. data from the same distribution as the test set. See Table 5 on page 24 of https://arxiv.org/pdf/2110.08207.pdf.

min=1.303, mean=1.335, max=1.378, sum=8.013 (6)
☠ T0++ is explicitly trained on these datasets, i.e. data from the same distribution as the test set. See Table 5 on page 24 of https://arxiv.org/pdf/2110.08207.pdf.

min=0.004, mean=0.004, max=0.004, sum=0.026 (6)
☠ T0++ is explicitly trained on these datasets, i.e. data from the same distribution as the test set. See Table 5 on page 24 of https://arxiv.org/pdf/2110.08207.pdf.

min=885.292, mean=886.838, max=888.921, sum=5321.026 (6)
☠ T0++ is explicitly trained on these datasets, i.e. data from the same distribution as the test set. See Table 5 on page 24 of https://arxiv.org/pdf/2110.08207.pdf.

min=128, mean=128, max=128, sum=768 (6)
☠ T0++ is explicitly trained on these datasets, i.e. data from the same distribution as the test set. See Table 5 on page 24 of https://arxiv.org/pdf/2110.08207.pdf.

min=3, mean=3, max=3, sum=18 (6)
☠ T0++ is explicitly trained on these datasets, i.e. data from the same distribution as the test set. See Table 5 on page 24 of https://arxiv.org/pdf/2110.08207.pdf.

min=0.14, mean=0.144, max=0.146, sum=0.861 (6)

min=0.393, mean=0.469, max=0.516, sum=1.407 (3)

min=4.621, mean=4.683, max=4.752, sum=28.101 (6)