The HumanEval benchmark for measuring functional correctness for synthesizing programs from docstrings [(Chen et al., 2021)](https://arxiv.org/pdf/2107.03374.pdf).

code_humaneval

The HumanEval benchmark for measuring functional correctness for synthesizing programs from docstrings [(Chen et al., 2021)](https://arxiv.org/pdf/2107.03374.pdf).

pass@1: Fraction of model outputs that pass the associated test cases.

The HumanEval benchmark for measuring functional correctness for synthesizing programs from docstrings [(Chen et al., 2021)](https://arxiv.org/pdf/2107.03374.pdf).

Denoised inference runtime (s): Average time to process a request to the model minus performance contention by using profiled runtimes from multiple trials of SyntheticEfficiencyScenario.

The HumanEval benchmark for measuring functional correctness for synthesizing programs from docstrings [(Chen et al., 2021)](https://arxiv.org/pdf/2107.03374.pdf).

# eval: Number of evaluation instances.

The HumanEval benchmark for measuring functional correctness for synthesizing programs from docstrings [(Chen et al., 2021)](https://arxiv.org/pdf/2107.03374.pdf).

# train: Number of training instances (e.g., in-context examples).

The HumanEval benchmark for measuring functional correctness for synthesizing programs from docstrings [(Chen et al., 2021)](https://arxiv.org/pdf/2107.03374.pdf).

truncated: Fraction of instances where the prompt itself was truncated (implies that there were no in-context examples).

The HumanEval benchmark for measuring functional correctness for synthesizing programs from docstrings [(Chen et al., 2021)](https://arxiv.org/pdf/2107.03374.pdf).

# prompt tokens: Number of tokens in the prompt.

The HumanEval benchmark for measuring functional correctness for synthesizing programs from docstrings [(Chen et al., 2021)](https://arxiv.org/pdf/2107.03374.pdf).

# output tokens: Actual number of output tokens.

The HumanEval benchmark for measuring functional correctness for synthesizing programs from docstrings [(Chen et al., 2021)](https://arxiv.org/pdf/2107.03374.pdf).

# trials: Number of trials, where in each trial we choose an independent, random set of training instances.

code_humaneval_dataset:humaneval

min=0.463, mean=0.463, max=0.463, sum=0.463 (1)

min=2.003, mean=2.003, max=2.003, sum=2.003 (1)

min=170.348, mean=170.348, max=170.348, sum=170.348 (1)

min=74.811, mean=74.811, max=74.811, sum=74.811 (1)

min=0.317, mean=0.317, max=0.317, sum=0.317 (1)

min=0.795, mean=0.795, max=0.795, sum=0.795 (1)

min=88.774, mean=88.774, max=88.774, sum=88.774 (1)