Preferences

techpression parent
It took me one question to have it spit out a completely dreamt up codebase, complete with emojis, promises of solutions and fixing all my problems, and of course nothing of it worked. It was a very simple question about something very well documented (Oban timeouts).

I doubt LLM benchmarks more and more, what are they even testing?


nakamoto_damacy
> what are they even testing?

How well the LLM does on the benchmarks. Obviously.

:P

techpression OP
Is there some kind of conversion ratio to actual value? ;)
ileonichwiesz
Sure there is. It’s called “higher numbers = more investor money”. Any improvement in actual utility is purely coincidental.
> It was a very simple question about something very well documented (Oban timeouts).

It's some 3rd party thing for Elixir, a niche within a niche. I wouldn't expect an LLM to do well there.

> I doubt LLM benchmarks more and more, what are they even testing?

Probably testing by asking it to solve a problem with python or (java|type)script. Perhaps not even specifying a language and watching it generate a generic React application.

user34283
Expectations vary wildly.

Sometimes people expect to use LLMs to unearth hard to find information.

In reality, LLMs seem to quickly fall apart when you go from ubiquitous libraries with 200k stars on GitHub to one with "just" 1k stars.

What makes the situation worse is the way LLMs fail. Hallucinations where it goes "my usage example did not work because you are on the wrong version of the library/using the wrong SDK" etc. are super common in this scenario. This leads to further time wasted trying to apply reasonably plausible fixes that are entirely hallucinated.

simonw
If a library isn't widely used (and is small enough) you can paste the entire thing into the context to ensure the LLM can use it effectively.
techpression OP
Something that is well documented should still perform well, there’s few places to go wrong, compared with something like React where the training data seems to be a cesspool of the worst code imaginable, at least that’s my experience using it for React.
Sure, I'm just answering your question of what people are benchmarking and it's not elixir. You could be the person that benchmarks LLMs in niche languages and shows how bad they are at it.

If your benchmark suite became popular enough and folks referenced it, the people training the LLMs would most likely try to make the model better at those languages.

This item has no comments currently.