Preferences

westurner parent
OpenAI/evals > Building an eval: https://github.com/openai/evals/blob/main/docs/build-eval.md

"Robustness of Model-Graded Evaluations and Automated Interpretability" (2023) https://www.lesswrong.com/posts/ZbjyCuqpwCMMND4fv/robustness... :

> The results inspire future work and should caution against unqualified trust in evaluations and automated interpretability.

From https://www.hackerneue.com/item?id=37451534 : add'l benchmarks: TheoremQA, Legalbench


This item has no comments currently.