The OpenbookQA benchmark for commonsense-intensive open book question answering [(Mihaylov et al., 2018)](https://aclanthology.org/D18-1260/).

openbookqa

The OpenbookQA benchmark for commonsense-intensive open book question answering [(Mihaylov et al., 2018)](https://aclanthology.org/D18-1260/).

Exact match: Fraction of instances that the predicted output matches a correct reference exactly.

The OpenbookQA benchmark for commonsense-intensive open book question answering [(Mihaylov et al., 2018)](https://aclanthology.org/D18-1260/).

10-bin expected calibration error: The average difference between the model's confidence and accuracy, averaged across 10 bins where each bin contains an equal number of points (only computed for classification tasks). Warning - not reliable for small datasets (e.g., with < 300 examples) because each bin will have very few examples.

The OpenbookQA benchmark for commonsense-intensive open book question answering [(Mihaylov et al., 2018)](https://aclanthology.org/D18-1260/).

Exact match: Fraction of instances that the predicted output matches a correct reference exactly.
- Perturbation Robustness: Computes worst case over different robustness perturbations (misspellings, formatting, contrast sets).

The OpenbookQA benchmark for commonsense-intensive open book question answering [(Mihaylov et al., 2018)](https://aclanthology.org/D18-1260/).

Exact match: Fraction of instances that the predicted output matches a correct reference exactly.
- Perturbation Fairness: Computes worst case over different fairness perturbations (changing dialect, race of names, gender).

The OpenbookQA benchmark for commonsense-intensive open book question answering [(Mihaylov et al., 2018)](https://aclanthology.org/D18-1260/).

Denoised inference runtime (s): Average time to process a request to the model minus performance contention by using profiled runtimes from multiple trials of SyntheticEfficiencyScenario.

The OpenbookQA benchmark for commonsense-intensive open book question answering [(Mihaylov et al., 2018)](https://aclanthology.org/D18-1260/).

# eval: Number of evaluation instances.

The OpenbookQA benchmark for commonsense-intensive open book question answering [(Mihaylov et al., 2018)](https://aclanthology.org/D18-1260/).

# train: Number of training instances (e.g., in-context examples).

The OpenbookQA benchmark for commonsense-intensive open book question answering [(Mihaylov et al., 2018)](https://aclanthology.org/D18-1260/).

truncated: Fraction of instances where the prompt itself was truncated (implies that there were no in-context examples).

The OpenbookQA benchmark for commonsense-intensive open book question answering [(Mihaylov et al., 2018)](https://aclanthology.org/D18-1260/).

# prompt tokens: Number of tokens in the prompt.

The OpenbookQA benchmark for commonsense-intensive open book question answering [(Mihaylov et al., 2018)](https://aclanthology.org/D18-1260/).

# output tokens: Actual number of output tokens.

The OpenbookQA benchmark for commonsense-intensive open book question answering [(Mihaylov et al., 2018)](https://aclanthology.org/D18-1260/).

# trials: Number of trials, where in each trial we choose an independent, random set of training instances.

openbookqa_dataset:openbookqa

min=0.534, mean=0.534, max=0.534, sum=0.534 (1)

min=0.25, mean=0.25, max=0.25, sum=0.25 (1)

min=0.43, mean=0.43, max=0.43, sum=0.43 (1)

min=0.466, mean=0.466, max=0.466, sum=0.466 (1)

min=0.259, mean=0.259, max=0.259, sum=0.259 (1)

min=4.348, mean=4.348, max=4.348, sum=4.348 (1)

min=0.514, mean=0.514, max=0.514, sum=0.514 (1)

min=0.412, mean=0.412, max=0.412, sum=0.412 (1)

min=0.444, mean=0.444, max=0.444, sum=0.444 (1)

min=0.238, mean=0.238, max=0.238, sum=0.238 (1)

min=0.52, mean=0.52, max=0.52, sum=0.52 (1)

min=0.258, mean=0.258, max=0.258, sum=0.258 (1)

min=0.424, mean=0.424, max=0.424, sum=0.424 (1)

min=0.472, mean=0.472, max=0.472, sum=0.472 (1)

min=0.281, mean=0.281, max=0.281, sum=0.281 (1)

min=0.56, mean=0.56, max=0.56, sum=0.56 (1)

min=0.215, mean=0.215, max=0.215, sum=0.215 (1)

min=0.474, mean=0.474, max=0.474, sum=0.474 (1)

min=0.478, mean=0.478, max=0.478, sum=0.478 (1)

min=0.558, mean=0.558, max=0.558, sum=0.558 (1)

min=0.237, mean=0.237, max=0.237, sum=0.237 (1)

min=0.47, mean=0.47, max=0.47, sum=0.47 (1)

min=0.488, mean=0.488, max=0.488, sum=0.488 (1)

min=0.542, mean=0.542, max=0.542, sum=0.542 (1)

min=0.53, mean=0.53, max=0.53, sum=0.53 (1)

min=0.448, mean=0.448, max=0.448, sum=0.448 (1)

min=0.45, mean=0.45, max=0.45, sum=0.45 (1)

min=0.244, mean=0.244, max=0.244, sum=0.244 (1)

min=0.482, mean=0.482, max=0.482, sum=0.482 (1)

min=0.447, mean=0.447, max=0.447, sum=0.447 (1)

min=5.27, mean=5.27, max=5.27, sum=5.27 (1)

min=0.132, mean=0.132, max=0.132, sum=0.132 (1)

min=0.248, mean=0.248, max=0.248, sum=0.248 (1)

min=0.438, mean=0.438, max=0.438, sum=0.438 (1)

min=0.032, mean=0.032, max=0.032, sum=0.032 (1)

min=5.444, mean=5.444, max=5.444, sum=5.444 (1)

min=0.55, mean=0.55, max=0.55, sum=0.55 (1)

min=0.235, mean=0.235, max=0.235, sum=0.235 (1)

min=0.314, mean=0.314, max=0.314, sum=0.314 (1)

min=5.358, mean=5.358, max=5.358, sum=5.358 (1)

min=0.225, mean=0.225, max=0.225, sum=0.225 (1)

min=0.446, mean=0.446, max=0.446, sum=0.446 (1)

min=0.201, mean=0.201, max=0.201, sum=0.201 (1)

min=0.496, mean=0.496, max=0.496, sum=0.496 (1)

min=0.275, mean=0.275, max=0.275, sum=0.275 (1)

min=0.382, mean=0.382, max=0.382, sum=0.382 (1)

min=0.42, mean=0.42, max=0.42, sum=0.42 (1)

min=0.187, mean=0.187, max=0.187, sum=0.187 (1)

min=0.348, mean=0.348, max=0.348, sum=0.348 (1)

min=0.379, mean=0.379, max=0.379, sum=0.379 (1)

min=0.28, mean=0.28, max=0.28, sum=0.28 (1)

min=0.214, mean=0.214, max=0.214, sum=0.214 (1)

min=0.588, mean=0.588, max=0.588, sum=0.588 (1)

min=0.207, mean=0.207, max=0.207, sum=0.207 (1)

min=0.538, mean=0.538, max=0.538, sum=0.538 (1)

min=0.23, mean=0.23, max=0.23, sum=0.23 (1)

min=0.414, mean=0.414, max=0.414, sum=0.414 (1)

min=0.44, mean=0.44, max=0.44, sum=0.44 (1)

min=0.468, mean=0.468, max=0.468, sum=0.468 (1)

min=0.582, mean=0.582, max=0.582, sum=0.582 (1)

min=0.231, mean=0.231, max=0.231, sum=0.231 (1)

min=0.492, mean=0.492, max=0.492, sum=0.492 (1)

min=0.508, mean=0.508, max=0.508, sum=0.508 (1)

min=0.398, mean=0.398, max=0.398, sum=0.398 (1)

min=0.416, mean=0.416, max=0.416, sum=0.416 (1)

min=0.019, mean=0.019, max=0.019, sum=0.019 (1)

min=0.524, mean=0.524, max=0.524, sum=0.524 (1)

min=0.232, mean=0.232, max=0.232, sum=0.232 (1)

min=0.024, mean=0.024, max=0.024, sum=0.024 (1)

min=5.346, mean=5.346, max=5.346, sum=5.346 (1)

min=0.586, mean=0.586, max=0.586, sum=0.586 (1)

min=0.209, mean=0.209, max=0.209, sum=0.209 (1)

min=0.038, mean=0.038, max=0.038, sum=0.038 (1)

min=0.454, mean=0.454, max=0.454, sum=0.454 (1)

min=0.188, mean=0.188, max=0.188, sum=0.188 (1)

min=0.562, mean=0.562, max=0.562, sum=0.562 (1)

min=0.243, mean=0.243, max=0.243, sum=0.243 (1)

min=0.476, mean=0.476, max=0.476, sum=0.476 (1)

min=0.504, mean=0.504, max=0.504, sum=0.504 (1)

min=0.282, mean=0.282, max=0.282, sum=0.282 (1)

min=0.408, mean=0.408, max=0.408, sum=0.408 (1)

Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.586, mean=0.586, max=0.586, sum=0.586 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.204, mean=0.204, max=0.204, sum=0.204 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.474, mean=0.474, max=0.474, sum=0.474 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.502, mean=0.502, max=0.502, sum=0.502 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.184, mean=0.184, max=0.184, sum=0.184 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=500, mean=500, max=500, sum=500 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0, mean=0, max=0, sum=0 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=5.27, mean=5.27, max=5.27, sum=5.27 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=1, mean=1, max=1, sum=1 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.26, mean=0.26, max=0.26, sum=0.26 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.396, mean=0.396, max=0.396, sum=0.396 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.43, mean=0.43, max=0.43, sum=0.43 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.079, mean=0.079, max=0.079, sum=0.079 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.438, mean=0.438, max=0.438, sum=0.438 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.3, mean=0.3, max=0.3, sum=0.3 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.314, mean=0.314, max=0.314, sum=0.314 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.326, mean=0.326, max=0.326, sum=0.326 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.111, mean=0.111, max=0.111, sum=0.111 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.38, mean=0.38, max=0.38, sum=0.38 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.346, mean=0.346, max=0.346, sum=0.346 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.27, mean=0.27, max=0.27, sum=0.27 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.318, mean=0.318, max=0.318, sum=0.318 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.136, mean=0.136, max=0.136, sum=0.136 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.646, mean=0.646, max=0.646, sum=0.646 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.216, mean=0.216, max=0.216, sum=0.216 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.572, mean=0.572, max=0.572, sum=0.572 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.578, mean=0.578, max=0.578, sum=0.578 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.594, mean=0.594, max=0.594, sum=0.594 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.238, mean=0.238, max=0.238, sum=0.238 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.52, mean=0.52, max=0.52, sum=0.52 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.54, mean=0.54, max=0.54, sum=0.54 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.158, mean=0.158, max=0.158, sum=0.158 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.514, mean=0.514, max=0.514, sum=0.514 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.321, mean=0.321, max=0.321, sum=0.321 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.424, mean=0.424, max=0.424, sum=0.424 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.452, mean=0.452, max=0.452, sum=0.452 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.119, mean=0.119, max=0.119, sum=0.119 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.362, mean=0.362, max=0.362, sum=0.362 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.39, mean=0.39, max=0.39, sum=0.39 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.386, mean=0.386, max=0.386, sum=0.386 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.122, mean=0.122, max=0.122, sum=0.122 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.487, mean=0.487, max=0.487, sum=0.487 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.248, mean=0.248, max=0.248, sum=0.248 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.266, mean=0.266, max=0.266, sum=0.266 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.076, mean=0.076, max=0.076, sum=0.076 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.