Comment by techpression

techpression Sep 30, 2025 parent

It took me one question to have it spit out a completely dreamt up codebase, complete with emojis, promises of solutions and fixing all my problems, and of course nothing of it worked. It was a very simple question about something very well documented (Oban timeouts).

I doubt LLM benchmarks more and more, what are they even testing?

nakamoto_damacy Sep 30, 2025

> what are they even testing?

How well the LLM does on the benchmarks. Obviously.

techpression OP Sep 30, 2025

Is there some kind of conversion ratio to actual value? ;)

ileonichwiesz Sep 30, 2025

Sure there is. It’s called “higher numbers = more investor money”. Any improvement in actual utility is purely coincidental.

doix Sep 30, 2025

> It was a very simple question about something very well documented (Oban timeouts).

It's some 3rd party thing for Elixir, a niche within a niche. I wouldn't expect an LLM to do well there.

> I doubt LLM benchmarks more and more, what are they even testing?

Probably testing by asking it to solve a problem with python or (java|type)script. Perhaps not even specifying a language and watching it generate a generic React application.

user34283 Sep 30, 2025

Expectations vary wildly.

Sometimes people expect to use LLMs to unearth hard to find information.

In reality, LLMs seem to quickly fall apart when you go from ubiquitous libraries with 200k stars on GitHub to one with "just" 1k stars.

What makes the situation worse is the way LLMs fail. Hallucinations where it goes "my usage example did not work because you are on the wrong version of the library/using the wrong SDK" etc. are super common in this scenario. This leads to further time wasted trying to apply reasonably plausible fixes that are entirely hallucinated.

simonw 5 days ago

If a library isn't widely used (and is small enough) you can paste the entire thing into the context to ensure the LLM can use it effectively.

techpression OP Sep 30, 2025

Something that is well documented should still perform well, there’s few places to go wrong, compared with something like React where the training data seems to be a cesspool of the worst code imaginable, at least that’s my experience using it for React.

doix Sep 30, 2025

Sure, I'm just answering your question of what people are benchmarking and it's not elixir. You could be the person that benchmarks LLMs in niche languages and shows how bad they are at it.

If your benchmark suite became popular enough and folks referenced it, the people training the LLMs would most likely try to make the model better at those languages.

This item has no comments currently.