The BoolQ benchmark for binary (yes/no) question answering [(Clark et al., 2019)](https://aclanthology.org/N19-1300/).

boolq

The BoolQ benchmark for binary (yes/no) question answering [(Clark et al., 2019)](https://aclanthology.org/N19-1300/).

Quasi-exact match: Fraction of instances that the predicted output matches a correct reference up to light processing.

The BoolQ benchmark for binary (yes/no) question answering [(Clark et al., 2019)](https://aclanthology.org/N19-1300/).

10-bin expected calibration error: The average difference between the model's confidence and accuracy, averaged across 10 bins where each bin contains an equal number of points (only computed for classification tasks). Warning - not reliable for small datasets (e.g., with < 300 examples) because each bin will have very few examples.

The BoolQ benchmark for binary (yes/no) question answering [(Clark et al., 2019)](https://aclanthology.org/N19-1300/).

Quasi-exact match: Fraction of instances that the predicted output matches a correct reference up to light processing.
- Perturbation Robustness: Computes worst case over different robustness perturbations (misspellings, formatting, contrast sets).

The BoolQ benchmark for binary (yes/no) question answering [(Clark et al., 2019)](https://aclanthology.org/N19-1300/).

Quasi-exact match: Fraction of instances that the predicted output matches a correct reference up to light processing.
- Perturbation Fairness: Computes worst case over different fairness perturbations (changing dialect, race of names, gender).

The BoolQ benchmark for binary (yes/no) question answering [(Clark et al., 2019)](https://aclanthology.org/N19-1300/).

Stereotypical associations (race, profession): Measures uneven association of racial groups (Asian, Hispanic, White) with target professions. This measurement is based on cooccurence statistics between the racially-associated names (word list from [Garg et al., 2018](https://www.pnas.org/doi/10.1073/pnas.1720347115); race associations based on US Census statistics) and the target professions (word list from [Bolukbasi et al., 2016](https://papers.nips.cc/paper/2016/hash/a486cd07e4ac3d270571622f4f316ec5-Abstract.html)).

The BoolQ benchmark for binary (yes/no) question answering [(Clark et al., 2019)](https://aclanthology.org/N19-1300/).

Stereotypical associations (gender, profession): Measures uneven association of gender groups (male, female) with target professions. This measurement is based on cooccurence statistics between the gender terms (word list from [Bolukbasi et al., 2016](https://papers.nips.cc/paper/2016/hash/a486cd07e4ac3d270571622f4f316ec5-Abstract.html)) and the target professions (word list from [Bolukbasi et al., 2016](https://papers.nips.cc/paper/2016/hash/a486cd07e4ac3d270571622f4f316ec5-Abstract.html)).

The BoolQ benchmark for binary (yes/no) question answering [(Clark et al., 2019)](https://aclanthology.org/N19-1300/).

Demographic representation (race): Measures uneven representation of racial groups (Asian, Hispanic, White). This measurement is based on disparities in the frequency statistics across racially-associated names (word list from [Garg et al., 2018](https://www.pnas.org/doi/10.1073/pnas.1720347115); race associations based on US Census statistics).

The BoolQ benchmark for binary (yes/no) question answering [(Clark et al., 2019)](https://aclanthology.org/N19-1300/).

Demographic representation (gender): Measures uneven representation of gender groups (male, female). This measurement is based on disparities in the frequency statistics across gender terms (word list from [Bolukbasi et al., 2016](https://papers.nips.cc/paper/2016/hash/a486cd07e4ac3d270571622f4f316ec5-Abstract.html)).

The BoolQ benchmark for binary (yes/no) question answering [(Clark et al., 2019)](https://aclanthology.org/N19-1300/).

Toxic fraction: Fraction of model outputs that are toxic (based on the PerspectiveAPI toxicity classifier).

The BoolQ benchmark for binary (yes/no) question answering [(Clark et al., 2019)](https://aclanthology.org/N19-1300/).

Denoised inference runtime (s): Average time to process a request to the model minus performance contention by using profiled runtimes from multiple trials of SyntheticEfficiencyScenario.

The BoolQ benchmark for binary (yes/no) question answering [(Clark et al., 2019)](https://aclanthology.org/N19-1300/).

# eval: Number of evaluation instances.

The BoolQ benchmark for binary (yes/no) question answering [(Clark et al., 2019)](https://aclanthology.org/N19-1300/).

# train: Number of training instances (e.g., in-context examples).

The BoolQ benchmark for binary (yes/no) question answering [(Clark et al., 2019)](https://aclanthology.org/N19-1300/).

truncated: Fraction of instances where the prompt itself was truncated (implies that there were no in-context examples).

The BoolQ benchmark for binary (yes/no) question answering [(Clark et al., 2019)](https://aclanthology.org/N19-1300/).

# prompt tokens: Number of tokens in the prompt.

The BoolQ benchmark for binary (yes/no) question answering [(Clark et al., 2019)](https://aclanthology.org/N19-1300/).

# output tokens: Actual number of output tokens.

The BoolQ benchmark for binary (yes/no) question answering [(Clark et al., 2019)](https://aclanthology.org/N19-1300/).

# trials: Number of trials, where in each trial we choose an independent, random set of training instances.

min=0.766, mean=0.776, max=0.786, sum=2.327 (3)

min=0.205, mean=0.215, max=0.223, sum=0.646 (3)

min=0.635, mean=0.65, max=0.659, sum=1.949 (3)

min=0.693, mean=0.709, max=0.73, sum=2.128 (3)

min=0.55, mean=0.62, max=0.727, sum=1.859 (3)

min=1000, mean=1000, max=1000, sum=3000 (3)

min=506.985, mean=694.652, max=952.985, sum=2083.955 (3)

min=0.652, mean=0.683, max=0.709, sum=2.05 (3)

min=0.085, mean=0.106, max=0.133, sum=0.319 (3)

min=0.539, mean=0.567, max=0.603, sum=1.701 (3)

min=0.591, mean=0.622, max=0.651, sum=1.867 (3)

min=0.43, mean=0.485, max=0.566, sum=1.455 (3)

min=0.712, mean=0.722, max=0.733, sum=2.165 (3)

min=0.139, mean=0.154, max=0.169, sum=0.462 (3)

min=0.632, mean=0.643, max=0.658, sum=1.929 (3)

min=0.656, mean=0.678, max=0.695, sum=2.035 (3)

min=0.47, mean=0.535, max=0.624, sum=1.606 (3)

min=0.799, mean=0.812, max=0.823, sum=2.437 (3)

min=0.155, mean=0.167, max=0.185, sum=0.5 (3)

min=0.669, mean=0.692, max=0.714, sum=2.077 (3)

min=0.751, mean=0.764, max=0.784, sum=2.291 (3)

min=0.818, mean=0.829, max=0.838, sum=2.487 (3)

min=0.163, mean=0.175, max=0.198, sum=0.526 (3)

min=0.72, mean=0.729, max=0.736, sum=2.188 (3)

min=0.78, mean=0.792, max=0.798, sum=2.375 (3)

min=2, mean=2.002, max=2.003, sum=6.005 (3)

min=0.816, mean=0.826, max=0.832, sum=2.478 (3)

min=0.179, mean=0.209, max=0.243, sum=0.627 (3)

min=0.714, mean=0.729, max=0.743, sum=2.187 (3)

min=0.758, mean=0.78, max=0.791, sum=2.34 (3)

min=2.002, mean=2.002, max=2.002, sum=6.006 (3)

min=0.737, mean=0.742, max=0.747, sum=2.227 (3)

min=0.124, mean=0.147, max=0.165, sum=0.44 (3)

min=0.602, mean=0.607, max=0.615, sum=1.822 (3)

min=0.675, mean=0.685, max=0.697, sum=2.055 (3)

min=0.7, mean=0.719, max=0.74, sum=2.156 (3)

min=0.056, mean=0.066, max=0.084, sum=0.197 (3)

min=0.643, mean=0.655, max=0.673, sum=1.965 (3)

min=0.634, mean=0.653, max=0.682, sum=1.958 (3)

min=651.658, mean=908.991, max=1252.658, sum=2726.974 (3)

min=1, mean=1.002, max=1.003, sum=3.006 (3)

min=0.752, mean=0.767, max=0.794, sum=2.3 (3)

min=0.11, mean=0.129, max=0.154, sum=0.387 (3)

min=0.637, mean=0.659, max=0.7, sum=1.976 (3)

min=0.692, mean=0.711, max=0.733, sum=2.133 (3)

min=0.748, mean=0.775, max=0.795, sum=2.325 (3)

min=0.06, mean=0.083, max=0.111, sum=0.248 (3)

min=0.624, mean=0.665, max=0.693, sum=1.996 (3)

min=0.66, mean=0.694, max=0.713, sum=2.081 (3)

min=0.814, mean=0.815, max=0.816, sum=2.446 (3)

min=0.035, mean=0.038, max=0.041, sum=0.114 (3)

min=0.751, mean=0.756, max=0.76, sum=2.269 (3)

min=0.778, mean=0.782, max=0.788, sum=2.345 (3)

min=0.566, mean=0.637, max=0.75, sum=1.912 (3)

min=660.073, mean=908.406, max=1242.073, sum=2725.219 (3)

min=1.004, mean=1.004, max=1.004, sum=3.012 (3)

min=0.659, mean=0.704, max=0.728, sum=2.112 (3)

min=0.153, mean=0.209, max=0.247, sum=0.626 (3)

min=0.595, mean=0.642, max=0.674, sum=1.926 (3)

min=0.601, mean=0.656, max=0.693, sum=1.968 (3)

min=0.665, mean=0.853, max=1.05, sum=2.558 (3)

min=636.774, mean=897.107, max=1242.774, sum=2691.322 (3)

T0++ is explicitly trained on these datasets, i.e. data from the same distribution as the test set. See Table 5 on page 24 of https://arxiv.org/pdf/2110.08207.pdf.

min=0, mean=0, max=0, sum=0 (3)
☠ T0++ is explicitly trained on these datasets, i.e. data from the same distribution as the test set. See Table 5 on page 24 of https://arxiv.org/pdf/2110.08207.pdf.

min=0.208, mean=0.322, max=0.435, sum=0.967 (3)
☠ T0++ is explicitly trained on these datasets, i.e. data from the same distribution as the test set. See Table 5 on page 24 of https://arxiv.org/pdf/2110.08207.pdf.

(0)
☠ T0++ is explicitly trained on these datasets, i.e. data from the same distribution as the test set. See Table 5 on page 24 of https://arxiv.org/pdf/2110.08207.pdf.

min=0, mean=0.25, max=0.5, sum=0.5 (2)
☠ T0++ is explicitly trained on these datasets, i.e. data from the same distribution as the test set. See Table 5 on page 24 of https://arxiv.org/pdf/2110.08207.pdf.

min=0.366, mean=0.374, max=0.385, sum=1.121 (3)
☠ T0++ is explicitly trained on these datasets, i.e. data from the same distribution as the test set. See Table 5 on page 24 of https://arxiv.org/pdf/2110.08207.pdf.

min=1000, mean=1000, max=1000, sum=3000 (3)
☠ T0++ is explicitly trained on these datasets, i.e. data from the same distribution as the test set. See Table 5 on page 24 of https://arxiv.org/pdf/2110.08207.pdf.

min=2.027, mean=3.972, max=4.988, sum=11.915 (3)
☠ T0++ is explicitly trained on these datasets, i.e. data from the same distribution as the test set. See Table 5 on page 24 of https://arxiv.org/pdf/2110.08207.pdf.

min=479.758, mean=702.438, max=905.932, sum=2107.314 (3)
☠ T0++ is explicitly trained on these datasets, i.e. data from the same distribution as the test set. See Table 5 on page 24 of https://arxiv.org/pdf/2110.08207.pdf.

min=5, mean=5, max=5, sum=15 (3)
☠ T0++ is explicitly trained on these datasets, i.e. data from the same distribution as the test set. See Table 5 on page 24 of https://arxiv.org/pdf/2110.08207.pdf.

min=3, mean=3, max=3, sum=9 (3)
☠ T0++ is explicitly trained on these datasets, i.e. data from the same distribution as the test set. See Table 5 on page 24 of https://arxiv.org/pdf/2110.08207.pdf.

min=0.702, mean=0.718, max=0.74, sum=2.153 (3)

min=0.037, mean=0.04, max=0.043, sum=0.119 (3)

min=0.601, mean=0.614, max=0.622, sum=1.842 (3)

min=0.657, mean=0.667, max=0.681, sum=2 (3)

min=0.519, mean=0.598, max=0.705, sum=1.795 (3)

min=669.307, mean=925.307, max=1269.307, sum=2775.921 (3)

min=1, mean=1.001, max=1.004, sum=3.004 (3)

min=0.705, mean=0.725, max=0.738, sum=2.176 (3)

min=0.066, mean=0.088, max=0.106, sum=0.265 (3)

min=0.514, mean=0.545, max=0.566, sum=1.635 (3)

min=0.653, mean=0.676, max=0.695, sum=2.027 (3)

min=0.359, mean=0.421, max=0.505, sum=1.263 (3)

min=0.65, mean=0.659, max=0.667, sum=1.977 (3)

min=0.069, mean=0.082, max=0.093, sum=0.247 (3)

min=0.556, mean=0.562, max=0.573, sum=1.686 (3)

min=0.589, mean=0.597, max=0.61, sum=1.792 (3)

min=0.308, mean=0.35, max=0.402, sum=1.049 (3)

min=0.447, mean=0.457, max=0.464, sum=1.372 (3)

min=0.072, mean=0.095, max=0.124, sum=0.285 (3)

min=0.352, mean=0.361, max=0.378, sum=1.083 (3)

min=0.346, mean=0.374, max=0.396, sum=1.121 (3)

min=0.319, mean=0.367, max=0.436, sum=1.101 (3)

min=0.761, mean=0.762, max=0.763, sum=2.285 (3)

min=0.037, mean=0.051, max=0.062, sum=0.154 (3)

min=0.712, mean=0.718, max=0.722, sum=2.153 (3)

min=0.702, mean=0.708, max=0.72, sum=2.124 (3)

min=0.693, mean=0.7, max=0.704, sum=2.1 (3)

min=0.088, mean=0.095, max=0.105, sum=0.284 (3)

min=0.508, mean=0.54, max=0.568, sum=1.62 (3)

min=0.626, mean=0.642, max=0.652, sum=1.925 (3)

min=0.791, mean=0.798, max=0.809, sum=2.394 (3)

min=0.048, mean=0.059, max=0.069, sum=0.178 (3)

min=0.715, mean=0.725, max=0.743, sum=2.176 (3)

min=0.74, mean=0.748, max=0.76, sum=2.244 (3)

min=0.849, mean=0.856, max=0.86, sum=2.569 (3)

min=0.018, mean=0.023, max=0.026, sum=0.069 (3)

min=0.806, mean=0.811, max=0.816, sum=2.432 (3)

min=0.812, mean=0.822, max=0.827, sum=2.465 (3)

min=0.646, mean=0.649, max=0.65, sum=1.946 (3)

min=0.043, mean=0.062, max=0.086, sum=0.187 (3)

min=0.608, mean=0.621, max=0.631, sum=1.863 (3)

min=0.638, mean=0.639, max=0.64, sum=1.916 (3)

min=0.354, mean=0.499, max=0.575, sum=1.497 (3)

min=0.659, mean=0.683, max=0.714, sum=2.048 (3)

min=0.168, mean=0.195, max=0.238, sum=0.585 (3)

min=0.548, mean=0.551, max=0.556, sum=1.653 (3)

min=0.594, mean=0.609, max=0.629, sum=1.827 (3)

min=0.515, mean=0.773, max=1.206, sum=2.318 (3)

min=656.897, mean=913.897, max=1251.897, sum=2741.691 (3)

min=0.732, mean=0.761, max=0.803, sum=2.283 (3)

min=0.348, mean=0.433, max=0.512, sum=1.298 (3)

min=0.624, mean=0.65, max=0.688, sum=1.951 (3)

min=0.697, mean=0.723, max=0.766, sum=2.168 (3)

min=0.667, mean=0.667, max=0.667, sum=2 (3)

min=0.125, mean=0.375, max=0.5, sum=1.125 (3)

min=0.27, mean=0.271, max=0.272, sum=0.814 (3)

min=0.969, mean=1.588, max=2.006, sum=4.765 (3)

min=0.004, mean=0.004, max=0.004, sum=0.012 (3)

min=386.367, mean=401.944, max=422.649, sum=1205.833 (3)

min=0.717, mean=0.746, max=0.762, sum=2.237 (3)

min=0.416, mean=0.46, max=0.512, sum=1.379 (3)

min=0.638, mean=0.646, max=0.651, sum=1.938 (3)

min=0.672, mean=0.698, max=0.714, sum=2.095 (3)

min=0.167, mean=0.23, max=0.357, sum=0.69 (3)

min=0.001, mean=0.001, max=0.001, sum=0.003 (3)

min=0.292, mean=0.313, max=0.341, sum=0.938 (3)

min=0.953, mean=1.57, max=1.978, sum=4.709 (3)

min=386.826, mean=402.285, max=424.449, sum=1206.854 (3)

min=0.777, mean=0.793, max=0.813, sum=2.379 (3)

min=0.177, mean=0.194, max=0.218, sum=0.581 (3)

min=0.584, mean=0.623, max=0.662, sum=1.869 (3)

min=0.712, mean=0.731, max=0.746, sum=2.193 (3)

min=0.71, mean=0.869, max=0.954, sum=2.608 (3)

min=0.753, mean=0.76, max=0.764, sum=2.281 (3)

min=0.193, mean=0.2, max=0.206, sum=0.601 (3)

min=0.666, mean=0.683, max=0.701, sum=2.049 (3)

min=0.696, mean=0.71, max=0.721, sum=2.131 (3)

min=0.272, mean=0.834, max=1.907, sum=2.501 (3)

min=0.798, mean=0.809, max=0.829, sum=2.428 (3)

min=0.017, mean=0.048, max=0.088, sum=0.144 (3)

min=0.724, mean=0.733, max=0.747, sum=2.198 (3)

min=0.756, mean=0.767, max=0.777, sum=2.3 (3)

min=0.685, mean=0.698, max=0.709, sum=2.095 (3)

min=0.063, mean=0.065, max=0.067, sum=0.195 (3)

min=0.623, mean=0.638, max=0.653, sum=1.914 (3)

min=0.649, mean=0.665, max=0.674, sum=1.996 (3)

Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.679, mean=0.722, max=0.77, sum=2.167 (3)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.047, mean=0.072, max=0.103, sum=0.215 (3)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.592, mean=0.639, max=0.677, sum=1.918 (3)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.635, mean=0.682, max=0.729, sum=2.046 (3)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

(0)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0, mean=0, max=0, sum=0 (3)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.204, mean=0.21, max=0.217, sum=0.631 (3)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=1000, mean=1000, max=1000, sum=3000 (3)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=5, mean=5, max=5, sum=15 (3)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=660.073, mean=908.406, max=1242.073, sum=2725.219 (3)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=1, mean=1, max=1, sum=3 (3)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=3, mean=3, max=3, sum=9 (3)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.597, mean=0.656, max=0.704, sum=1.969 (3)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.051, mean=0.079, max=0.115, sum=0.236 (3)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.484, mean=0.545, max=0.599, sum=1.635 (3)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.535, mean=0.594, max=0.631, sum=1.782 (3)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.096, mean=0.1, max=0.104, sum=0.3 (3)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.52, mean=0.574, max=0.623, sum=1.723 (3)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.036, mean=0.068, max=0.089, sum=0.203 (3)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.432, mean=0.477, max=0.522, sum=1.431 (3)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.404, mean=0.436, max=0.457, sum=1.307 (3)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.119, mean=0.121, max=0.125, sum=0.364 (3)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.525, mean=0.581, max=0.627, sum=1.743 (3)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.