Preferences

> It was a very simple question about something very well documented (Oban timeouts).

It's some 3rd party thing for Elixir, a niche within a niche. I wouldn't expect an LLM to do well there.

> I doubt LLM benchmarks more and more, what are they even testing?

Probably testing by asking it to solve a problem with python or (java|type)script. Perhaps not even specifying a language and watching it generate a generic React application.


user34283
Expectations vary wildly.

Sometimes people expect to use LLMs to unearth hard to find information.

In reality, LLMs seem to quickly fall apart when you go from ubiquitous libraries with 200k stars on GitHub to one with "just" 1k stars.

What makes the situation worse is the way LLMs fail. Hallucinations where it goes "my usage example did not work because you are on the wrong version of the library/using the wrong SDK" etc. are super common in this scenario. This leads to further time wasted trying to apply reasonably plausible fixes that are entirely hallucinated.

simonw
If a library isn't widely used (and is small enough) you can paste the entire thing into the context to ensure the LLM can use it effectively.
techpression
Something that is well documented should still perform well, there’s few places to go wrong, compared with something like React where the training data seems to be a cesspool of the worst code imaginable, at least that’s my experience using it for React.
doix OP
Sure, I'm just answering your question of what people are benchmarking and it's not elixir. You could be the person that benchmarks LLMs in niche languages and shows how bad they are at it.

If your benchmark suite became popular enough and folks referenced it, the people training the LLMs would most likely try to make the model better at those languages.

This item has no comments currently.