Expectations vary wildly.
Sometimes people expect to use LLMs to unearth hard to find information.
In reality, LLMs seem to quickly fall apart when you go from ubiquitous libraries with 200k stars on GitHub to one with "just" 1k stars.
What makes the situation worse is the way LLMs fail. Hallucinations where it goes "my usage example did not work because you are on the wrong version of the library/using the wrong SDK" etc. are super common in this scenario. This leads to further time wasted trying to apply reasonably plausible fixes that are entirely hallucinated.
Something that is well documented should still perform well, there’s few places to go wrong, compared with something like React where the training data seems to be a cesspool of the worst code imaginable, at least that’s my experience using it for React.
Sure, I'm just answering your question of what people are benchmarking and it's not elixir. You could be the person that benchmarks LLMs in niche languages and shows how bad they are at it.
If your benchmark suite became popular enough and folks referenced it, the people training the LLMs would most likely try to make the model better at those languages.
It's some 3rd party thing for Elixir, a niche within a niche. I wouldn't expect an LLM to do well there.
> I doubt LLM benchmarks more and more, what are they even testing?
Probably testing by asking it to solve a problem with python or (java|type)script. Perhaps not even specifying a language and watching it generate a generic React application.