Targeted evaluation of knowledge (e.g. factual, cultural, commonsense).

knowledge

How many models this model outperform on average (over columns).

The NaturalQuestions [(Kwiatkowski et al., 2019)](https://aclanthology.org/Q19-1026/) benchmark for question answering based on naturally-occurring queries through Google Search. The input does not include the Wikipedia page with the answer.

F1: Average F1 score in terms of word overlap between the model output and correct reference.

The HellaSwag benchmark for commonsense reasoning in question answering [(Zellers et al., 2019)](https://aclanthology.org/P19-1472/).

Exact match: Fraction of instances that the predicted output matches a correct reference exactly.

The OpenbookQA benchmark for commonsense-intensive open book question answering [(Mihaylov et al., 2018)](https://aclanthology.org/D18-1260/).

Exact match: Fraction of instances that the predicted output matches a correct reference exactly.

The TruthfulQA benchmarking for measuring model truthfulness and commonsense knowledge in question answering [(Lin et al., 2022)](https://aclanthology.org/2022.acl-long.229/).

Exact match: Fraction of instances that the predicted output matches a correct reference exactly.

The Massive Multitask Language Understanding (MMLU) benchmark for knowledge-intensive question answering across 57 domains [(Hendrycks et al., 2021)](https://openreview.net/forum?id=d7KBjmI3GmQ).

Exact match: Fraction of instances that the predicted output matches a correct reference exactly.

Scenario introduced in this work, inspired by [Petroni et al. (2019)](https://aclanthology.org/D19-1250/), to more extensively test factual knowledge.

Quasi-exact match: Fraction of instances that the predicted output matches a correct reference up to light processing.

accuracy

min=0.288, mean=0.293, max=0.302, sum=0.879 (3)

min=0.765, mean=0.765, max=0.765, sum=0.765 (1)

min=0.534, mean=0.534, max=0.534, sum=0.534 (1)

min=0.157, mean=0.175, max=0.187, sum=0.524 (3)

min=0.19, mean=0.259, max=0.35, sum=3.891 (15)

min=0.055, mean=0.28, max=0.513, sum=8.404 (30)

min=0.182, mean=0.19, max=0.196, sum=0.571 (3)

min=0.514, mean=0.514, max=0.514, sum=0.514 (1)

min=0.19, mean=0.197, max=0.2, sum=0.59 (3)

min=0.2, mean=0.241, max=0.298, sum=3.617 (15)

min=0.051, mean=0.226, max=0.479, sum=6.769 (30)

min=0.229, mean=0.233, max=0.239, sum=0.7 (3)

min=0.739, mean=0.739, max=0.739, sum=0.739 (1)

min=0.52, mean=0.52, max=0.52, sum=0.52 (1)

min=0.171, mean=0.193, max=0.217, sum=0.58 (3)

min=0.2, mean=0.27, max=0.35, sum=4.047 (15)

min=0.044, mean=0.269, max=0.531, sum=8.061 (30)

min=0.332, mean=0.337, max=0.341, sum=1.012 (3)

min=0.764, mean=0.764, max=0.764, sum=0.764 (1)

min=0.56, mean=0.56, max=0.56, sum=0.56 (1)

min=0.266, mean=0.306, max=0.333, sum=0.917 (3)

min=0.23, mean=0.445, max=0.8, sum=6.677 (15)

min=0.058, mean=0.313, max=0.594, sum=9.403 (30)

min=0.384, mean=0.385, max=0.387, sum=1.156 (3)

min=0.788, mean=0.788, max=0.788, sum=0.788 (1)

min=0.558, mean=0.558, max=0.558, sum=0.558 (1)

min=0.367, mean=0.437, max=0.485, sum=1.312 (3)

min=0.23, mean=0.48, max=0.83, sum=7.207 (15)

min=0.005, mean=0.154, max=0.521, sum=4.617 (30)

min=0.35, mean=0.356, max=0.362, sum=1.068 (3)

min=0.781, mean=0.781, max=0.781, sum=0.781 (1)

min=0.542, mean=0.542, max=0.542, sum=0.542 (1)

min=0.287, mean=0.348, max=0.384, sum=1.043 (3)

min=0.24, mean=0.475, max=0.81, sum=7.13 (15)

min=0.001, mean=0.129, max=0.531, sum=3.856 (30)

min=0.269, mean=0.274, max=0.28, sum=0.821 (3)

min=0.729, mean=0.729, max=0.729, sum=0.729 (1)

min=0.53, mean=0.53, max=0.53, sum=0.53 (1)

min=0.22, mean=0.245, max=0.283, sum=0.734 (3)

min=0.211, mean=0.339, max=0.5, sum=5.078 (15)

min=0.001, mean=0.128, max=0.479, sum=3.833 (30)

min=0.197, mean=0.202, max=0.206, sum=0.606 (3)

min=0.165, mean=0.182, max=0.194, sum=0.547 (3)

min=0.193, mean=0.27, max=0.32, sum=4.045 (15)

min=0.072, mean=0.275, max=0.531, sum=8.258 (30)

min=0.252, mean=0.254, max=0.257, sum=0.762 (3)

min=0.208, mean=0.221, max=0.231, sum=0.662 (3)

min=0.23, mean=0.321, max=0.49, sum=4.811 (15)

min=0.091, mean=0.308, max=0.531, sum=9.242 (30)

min=0.281, mean=0.293, max=0.299, sum=0.878 (3)

min=0.2, mean=0.222, max=0.258, sum=0.667 (3)

min=0.22, mean=0.38, max=0.61, sum=5.702 (15)

min=0.064, mean=0.335, max=0.625, sum=10.036 (30)

min=0.279, mean=0.288, max=0.295, sum=0.863 (3)

min=0.807, mean=0.807, max=0.807, sum=0.807 (1)

min=0.298, mean=0.368, max=0.408, sum=1.472 (4)

min=0.25, mean=0.481, max=0.78, sum=7.22 (15)

min=0.109, mean=0.336, max=0.615, sum=10.093 (30)

min=0.208, mean=0.216, max=0.221, sum=0.648 (3)

min=0.744, mean=0.744, max=0.744, sum=0.744 (1)

min=0.197, mean=0.205, max=0.211, sum=0.82 (4)

min=0.19, mean=0.299, max=0.42, sum=4.481 (15)

min=0.041, mean=0.221, max=0.479, sum=6.64 (30)

min=0.038, mean=0.039, max=0.04, sum=0.117 (3)

min=0.347, mean=0.377, max=0.411, sum=1.508 (4)

min=0.25, mean=0.407, max=0.67, sum=6.098 (15)

min=0, mean=0.013, max=0.052, sum=0.383 (30)

min=0.302, mean=0.312, max=0.32, sum=0.937 (3)

min=0.811, mean=0.811, max=0.811, sum=0.811 (1)

min=0.55, mean=0.55, max=0.55, sum=0.55 (1)

min=0.177, mean=0.198, max=0.225, sum=0.593 (3)

min=0.228, mean=0.353, max=0.56, sum=5.296 (15)

min=0.098, mean=0.336, max=0.598, sum=10.065 (30)

min=0.227, mean=0.232, max=0.235, sum=0.695 (3)

min=0.736, mean=0.736, max=0.736, sum=0.736 (1)

min=0.161, mean=0.181, max=0.2, sum=0.544 (3)

min=0.19, mean=0.324, max=0.4, sum=4.854 (15)

min=0.067, mean=0.286, max=0.542, sum=8.566 (30)

min=0.173, mean=0.177, max=0.179, sum=0.53 (3)

min=0.706, mean=0.706, max=0.706, sum=0.706 (1)

min=0.496, mean=0.496, max=0.496, sum=0.496 (1)

min=0.176, mean=0.19, max=0.203, sum=0.57 (3)

min=0.18, mean=0.279, max=0.36, sum=4.182 (15)

min=0.056, mean=0.254, max=0.5, sum=7.622 (30)

min=0.072, mean=0.078, max=0.082, sum=0.234 (3)

min=0.483, mean=0.483, max=0.483, sum=0.483 (1)

min=0.348, mean=0.348, max=0.348, sum=0.348 (1)

min=0.202, mean=0.217, max=0.226, sum=0.65 (3)

min=0.18, mean=0.264, max=0.42, sum=3.963 (15)

min=0.006, mean=0.141, max=0.479, sum=4.238 (30)

min=0.359, mean=0.361, max=0.365, sum=1.083 (3)

min=0.81, mean=0.81, max=0.81, sum=0.81 (1)

min=0.588, mean=0.588, max=0.588, sum=0.588 (1)

min=0.164, mean=0.169, max=0.179, sum=0.508 (3)

min=0.21, mean=0.382, max=0.67, sum=5.731 (15)

min=0.105, mean=0.342, max=0.635, sum=10.266 (30)

min=0.193, mean=0.199, max=0.203, sum=0.598 (3)

min=0.726, mean=0.726, max=0.726, sum=0.726 (1)

min=0.538, mean=0.538, max=0.538, sum=0.538 (1)

min=0.19, mean=0.215, max=0.237, sum=0.645 (3)

min=0.18, mean=0.254, max=0.32, sum=3.806 (15)

min=0.06, mean=0.254, max=0.51, sum=7.632 (30)

min=0.227, mean=0.229, max=0.233, sum=0.687 (3)

min=0.752, mean=0.752, max=0.752, sum=0.752 (1)

min=0.197, mean=0.203, max=0.213, sum=0.61 (3)

min=0.26, mean=0.406, max=0.63, sum=6.095 (15)

min=0.056, mean=0.288, max=0.579, sum=8.634 (30)

min=0.369, mean=0.372, max=0.374, sum=1.116 (3)

min=0.582, mean=0.582, max=0.582, sum=0.582 (1)

min=0.265, mean=0.269, max=0.275, sum=0.807 (3)

min=0.23, mean=0.452, max=0.79, sum=6.786 (15)

min=0.115, mean=0.348, max=0.562, sum=10.442 (30)

min=0.146, mean=0.156, max=0.164, sum=0.468 (3)

min=0.663, mean=0.663, max=0.663, sum=0.663 (1)

min=0.187, mean=0.199, max=0.213, sum=0.797 (4)

min=0.14, mean=0.249, max=0.3, sum=3.728 (15)

min=0.022, mean=0.168, max=0.42, sum=5.036 (30)

min=0.189, mean=0.193, max=0.195, sum=0.578 (3)

min=0.718, mean=0.718, max=0.718, sum=0.718 (1)

min=0.524, mean=0.524, max=0.524, sum=0.524 (1)

min=0.205, mean=0.216, max=0.225, sum=0.864 (4)

min=0.21, mean=0.276, max=0.351, sum=4.146 (15)

min=0.024, mean=0.207, max=0.432, sum=6.2 (30)

min=0.19, mean=0.194, max=0.196, sum=0.581 (3)

min=0.104, mean=0.133, max=0.15, sum=0.532 (4)

min=0.211, mean=0.29, max=0.4, sum=4.354 (15)

min=0.005, mean=0.118, max=0.323, sum=3.529 (30)

min=0.2, mean=0.204, max=0.21, sum=0.612 (3)

min=0.162, mean=0.193, max=0.232, sum=0.772 (4)

min=0.2, mean=0.291, max=0.39, sum=4.368 (15)

min=0.02, mean=0.168, max=0.438, sum=5.043 (30)

min=0.294, mean=0.297, max=0.301, sum=0.89 (3)

min=0.791, mean=0.791, max=0.791, sum=0.791 (1)

min=0.586, mean=0.586, max=0.586, sum=0.586 (1)

min=0.228, mean=0.25, max=0.269, sum=1.002 (4)

min=0.21, mean=0.318, max=0.48, sum=4.775 (15)

min=0.028, mean=0.22, max=0.521, sum=6.61 (30)

min=0.254, mean=0.258, max=0.262, sum=0.774 (3)

min=0.745, mean=0.745, max=0.745, sum=0.745 (1)

min=0.185, mean=0.201, max=0.22, sum=0.804 (4)

min=0.2, mean=0.276, max=0.37, sum=4.141 (15)

min=0.032, mean=0.202, max=0.438, sum=6.051 (30)

min=0.375, mean=0.384, max=0.389, sum=1.152 (3)

min=0.799, mean=0.799, max=0.799, sum=0.799 (1)

min=0.562, mean=0.562, max=0.562, sum=0.562 (1)

min=0.22, mean=0.251, max=0.275, sum=0.752 (3)

min=0.24, mean=0.469, max=0.78, sum=7.035 (15)

min=0.098, mean=0.337, max=0.646, sum=10.108 (30)

min=0.202, mean=0.21, max=0.225, sum=0.631 (3)

min=0.704, mean=0.704, max=0.704, sum=0.704 (1)

min=0.478, mean=0.478, max=0.478, sum=0.478 (1)

min=0.156, mean=0.167, max=0.173, sum=0.5 (3)

min=0.2, mean=0.242, max=0.35, sum=3.627 (15)

min=0.049, mean=0.236, max=0.479, sum=7.084 (30)

min=0.321, mean=0.329, max=0.338, sum=0.986 (3)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.775, mean=0.775, max=0.775, sum=0.775 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.586, mean=0.586, max=0.586, sum=0.586 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.182, mean=0.194, max=0.213, sum=0.581 (3)

min=0.26, mean=0.422, max=0.7, sum=6.336 (15)

min=0.081, mean=0.306, max=0.552, sum=9.184 (30)

min=0.194, mean=0.199, max=0.203, sum=0.596 (3)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.682, mean=0.682, max=0.682, sum=0.682 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.502, mean=0.502, max=0.502, sum=0.502 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.222, mean=0.232, max=0.251, sum=0.696 (3)

min=0.19, mean=0.243, max=0.29, sum=3.642 (15)

min=0.049, mean=0.236, max=0.505, sum=7.075 (30)

min=0.115, mean=0.119, max=0.123, sum=0.357 (3)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.555, mean=0.555, max=0.555, sum=0.555 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.438, mean=0.438, max=0.438, sum=0.438 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.174, mean=0.188, max=0.196, sum=0.563 (3)

min=0.17, mean=0.235, max=0.35, sum=3.518 (15)

min=0.025, mean=0.184, max=0.432, sum=5.515 (30)

min=0.081, mean=0.082, max=0.083, sum=0.247 (3)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.435, mean=0.435, max=0.435, sum=0.435 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.38, mean=0.38, max=0.38, sum=0.38 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.206, mean=0.215, max=0.222, sum=0.645 (3)

min=0.132, mean=0.243, max=0.32, sum=3.641 (15)

min=0.006, mean=0.124, max=0.417, sum=3.708 (30)

min=0.397, mean=0.406, max=0.413, sum=1.219 (3)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.822, mean=0.822, max=0.822, sum=0.822 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.646, mean=0.646, max=0.646, sum=0.646 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.558, mean=0.593, max=0.615, sum=1.78 (3)

min=0.28, mean=0.569, max=0.86, sum=8.532 (15)

min=0.16, mean=0.373, max=0.625, sum=11.201 (30)

min=0.367, mean=0.383, max=0.394, sum=1.149 (3)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.815, mean=0.815, max=0.815, sum=0.815 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.594, mean=0.594, max=0.594, sum=0.594 (1)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.

min=0.596, mean=0.61, max=0.63, sum=1.829 (3)

min=0.26, mean=0.568, max=0.86, sum=8.515 (15)

min=0.138, mean=0.392, max=0.656, sum=11.774 (30)

min=0.167, mean=0.175, max=0.179, sum=0.525 (3)
⚠ Brown et al. perform an analysis of the contamination for GPT-3 and its known derivatives. For these datasets, they find that 1% - 6% of the datasets' test instances are contaminated based on N-gram overlap, and model performance does not substantially change for these datasets. See Table C.1 on page 45 of https://arxiv.org/pdf/2005.14165.pdf.