Comment by enragedcacti

enragedcacti 3 days ago parent

What is interesting about reducing the problem to counting? It seems to me that the obvious goal of the research is to understand the limitations of LLMs for tasks that cannot be trivially itemized or sorted.

emporas 3 days ago

The more specific are the instructions, the better they perform. There is a huge difference, between trying to find omitted text, or omitted words, or omitted sentences.

If omitted words are to be found, put each word into it's own line and number it. The same with sentences.

If you are trying to find omitted words and sentences, make one pass with only words, and another one with only sentences. Then combine the results.

enragedcacti OP 3 days ago

To what end? You have to segment and order the document (i.e. solve the problem) just to craft your prompt so the LLM spitting the solution back to you is useless. The experiment uses these tasks because test cases can be algorithmically generated and scored, but it's not very interesting that one can structure the input to solve this specific, useless task with LLMs. It is interesting, though, that this limitation could carry over into tasks where traditional algorithms fail. LLMs improving at this would be legitimately useful which is why a benchmark makes sense, but cheating the benchmarks by augmenting the input doesn't.

emporas 2 days ago

> You have to segment and order the document (i.e. solve the problem)

Well, let's say that if this benchmark targets AGI, then no help should be given, no segmentation or structuring of information in any way, and it should be able to figure it out by itself.

If this benchmark targets LLMs trained on internet data, statistical engines that is, not AGI, these engines have a preference for structuring of information in order to solve a problem.

Segmenting the problem into smaller parts, using numbers usually, but dashes are acceptable as well, is what they have seen countless of times in textbook examples. When the input doesn't match prior input they have seen, then their performance easily degrades from superhuman to utter confusion. Superhuman for small problems, anyway.

This problem of omitted information is interesting to me, many times I want to interpolate some paragraphs into stories I write, to fill up some plot holes. I used the word "interpolate" in unstructured text, and the results were underwhelming, pretty bad most of the time. From now on, I will number each paragraph, and ask it to find omitted text in there.

This item has no comments currently.