How well the LLM does on the benchmarks. Obviously.
:P
It's some 3rd party thing for Elixir, a niche within a niche. I wouldn't expect an LLM to do well there.
> I doubt LLM benchmarks more and more, what are they even testing?
Probably testing by asking it to solve a problem with python or (java|type)script. Perhaps not even specifying a language and watching it generate a generic React application.
Sometimes people expect to use LLMs to unearth hard to find information.
In reality, LLMs seem to quickly fall apart when you go from ubiquitous libraries with 200k stars on GitHub to one with "just" 1k stars.
What makes the situation worse is the way LLMs fail. Hallucinations where it goes "my usage example did not work because you are on the wrong version of the library/using the wrong SDK" etc. are super common in this scenario. This leads to further time wasted trying to apply reasonably plausible fixes that are entirely hallucinated.
If your benchmark suite became popular enough and folks referenced it, the people training the LLMs would most likely try to make the model better at those languages.
I doubt LLM benchmarks more and more, what are they even testing?