Comment by yousif_123123

yousif_123123 5 days ago parent

This is very interesting. 1. Authors mention the attention mechanism being perhaps unable to attend to the location of gaps since the gaps aren't tokens. But I would've expected a good LLM transformer to be at least a bit close to the gap location. I don't understand why mathematically the architecture is less suitable for that, it could attend to a region that may contain gaps. I wonder if fine-tuning on a task like this could help? 2. Shorter inputs with less omissions were harder to solve. That is not completely surprising, as a human doing this task, if 1 word was missing it would be harder to notice. And similarly 1 line would be harder than 10 lines. But still interesting for an LLM to have this problem. 3. Reasoning models do better, as they can write out the documents and potentially solve this easily. It still very surprising that this doesn't lead to 100% accuracy. This should be a trivial task. Like the paper says, a trivial program can be written to solve this. Perhaps ChatGPT (or similar agent) could read this paper while training, and know to write and run python when solving an issue like this.

The most interesting thing though, is what other aspects of intelligence we may not have identified explicitly, and whether LLMs and current AI is very bad at them. This paper suggests that there likely are many of those, and it seems in general a pretty fun time for people working building benchmarks.

banq 4 days ago (dead)

This item has no comments currently.