Preferences

Poor benchmark.

I tried their prompt [1] using 3 numbered items, qwq-32b got it right with no problems at all. I think it could solve 100 numbered items correctly 100% of the time, but it probably needs a million tokens. Probably even more, 10 million.

The limitation of 5000 tokens is peanuts for a reasoning model. Give it a lot of testing time compute, 10x of 5000 tokens is still too little.

The authors talk about long inputs, so, if it is 100 pages, give it a billion tokens.

The correct way to implement this is in batches, find the first 5 numbered items in the omitted input text, if it does find those, then simplify the input items and the omitted input items and go again.

Depending on the size of the input, it will always need a hefty amount of tokens, but simplification will help it backtrack correctly and not lose the thread entirely.

[1]You are helping a student practice memorizing poems. The student will recite a poem, but they may have missed some lines. Your task is to identify exactly which lines are missing from their recitation. List only the missing lines, nothing else. User Message Here is the complete original poem: 1)Quisella's lashes fluttered panic-morse. 2)The Moisture Vampires leeches that sucked humidity. 3)Lysandra's nostrils flared precisely one degree. Now, here is my recitation which may be missing some lines: Quisella's lashes fluttered panic-morse. Lysandra's nostrils flared precisely one degree. What lines did I miss? Please list only the missing lines, nothing else.


enragedcacti
What is interesting about reducing the problem to counting? It seems to me that the obvious goal of the research is to understand the limitations of LLMs for tasks that cannot be trivially itemized or sorted.
emporas OP
The more specific are the instructions, the better they perform. There is a huge difference, between trying to find omitted text, or omitted words, or omitted sentences.

If omitted words are to be found, put each word into it's own line and number it. The same with sentences.

If you are trying to find omitted words and sentences, make one pass with only words, and another one with only sentences. Then combine the results.

enragedcacti
To what end? You have to segment and order the document (i.e. solve the problem) just to craft your prompt so the LLM spitting the solution back to you is useless. The experiment uses these tasks because test cases can be algorithmically generated and scored, but it's not very interesting that one can structure the input to solve this specific, useless task with LLMs. It is interesting, though, that this limitation could carry over into tasks where traditional algorithms fail. LLMs improving at this would be legitimately useful which is why a benchmark makes sense, but cheating the benchmarks by augmenting the input doesn't.
emporas OP
> You have to segment and order the document (i.e. solve the problem)

Well, let's say that if this benchmark targets AGI, then no help should be given, no segmentation or structuring of information in any way, and it should be able to figure it out by itself.

If this benchmark targets LLMs trained on internet data, statistical engines that is, not AGI, these engines have a preference for structuring of information in order to solve a problem.

Segmenting the problem into smaller parts, using numbers usually, but dashes are acceptable as well, is what they have seen countless of times in textbook examples. When the input doesn't match prior input they have seen, then their performance easily degrades from superhuman to utter confusion. Superhuman for small problems, anyway.

This problem of omitted information is interesting to me, many times I want to interpolate some paragraphs into stories I write, to fill up some plot holes. I used the word "interpolate" in unstructured text, and the results were underwhelming, pretty bad most of the time. From now on, I will number each paragraph, and ask it to find omitted text in there.

emporas OP
I just tried qwq-32b using the numbered headlines of HN right now, with 26 items [1], I removed 3 headlines, still found all 3 omitted items first try, perfect, and it didn't even consume 50.000 tokens.

[1] https://gist.github.com/pramatias/fee1391ad08c7b965f435f3af1...

This item has no comments currently.