Preferences

Have you thought about benchmarking models a month or two after release to see how it competes vs the day 1 release

simonw
For that to be useful I'd need to be running much better benchmarks - anything less than a few hundred numerically scored tasks would be unlikely to reliably identity differences.

An organization like Artificial Analysis would be a better fit for that kind of investigation: https://artificialanalysis.ai/

westurner
Manually,

From https://www.hackerneue.com/item?id=40859434 :

> E.g promptfoo and chainforge have multi-LLM workflows.

> Promptfoo has a YAML configuration for prompts, providers,: https://www.promptfoo.dev/docs/configuration/guide/

openai/evals//docs/build-eval.md: https://github.com/openai/evals/blob/main/docs/build-eval.md

From https://www.hackerneue.com/item?id=45267271 ;

> API facades like OpenLLM and model routers like OpenRouter have standard interfaces for many or most LLM inputs and outputs. Tools like Promptfoo, ChainForge, and LocalAI also all have abstractions over many models.

> What are the open standards for representing LLM inputs, and outputs?

> W3C PROV has prov:Entity, prov:Activity, and prov:Agent for modeling AI provenance: who or what did what when.

> LLM evals could be represented in W3C EARL Evaluation and Reporting Language

"Can Large Language Models Emulate Judicial Decision-Making? [Paper]" https://www.hackerneue.com/item?id=42927611

"California governor signs AI transparency bill into law" (2025) https://www.hackerneue.com/item?id=45418428 :

> https://sb53.info/

Is this the first of its sort?:

> CalCompute

This item has no comments currently.