Comment by simonw - Hacker Neue

simonw Dec 8, 2025 parent

The idea that LLMs were trained on miscellaneous scraped low quality code may have been true a year ago, but I suspect it is no longer true today

All of the major model vendors are competing on how well their models can code. The key to getting better code out of the model is improving the quality of the code that it is trained on.

Filtering training data for high quality code is easier than filtering for high quality data if other types.

My strong hunch is that the quality of code being used to train current frontier models is way higher than it was a year ago.

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous