HumanEval is an evaluation harness for the HumanEval Problem solving dataset described in the paper Chen, Mark, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, et al. “Evaluating large language models trained on code,” 2021. https://arxiv.org/abs/2107.03374.; arXiv. The code is available on github
You can see this being used to evaluate new models like here - Beating GPT-4 on HumanEval with a Fine-Tuned CodeLlama-34B by Phind
We have fine-tuned CodeLlama-34B and CodeLlama-34B-Python on an internal Phind dataset that achieved 67.6% and 69.5% pass@1 on HumanEval, respectively. GPT-4 achieved 67% according to their official technical report in March. To ensure result validity, we applied OpenAI’s decontamination methodology to our dataset.