To be fair, I'd put finding literal string diffs in the category of asking LLMs to do rote arithmetic.
The attention mechanism does far too much complex thinking for such a dumb task. This is precisely where you need to dumb down and focus and be disciplined rather than do high level next token prediction.
You'd benefit from actually asking the LLM to list the full document and compare, kind of like reasoning, and similar to how LLMs perform better when they break down arithmetic or algebra tasks into smaller steps.
Also my guess would be that the models that perform well are MoE models where there may be an Expert or two that does well on tasks that needs focus rather than intuition. So without knowing anything about Gemini Flash, my guess would be that it's an MoE model.
The attention mechanism does far too much complex thinking for such a dumb task. This is precisely where you need to dumb down and focus and be disciplined rather than do high level next token prediction.
You'd benefit from actually asking the LLM to list the full document and compare, kind of like reasoning, and similar to how LLMs perform better when they break down arithmetic or algebra tasks into smaller steps.
Also my guess would be that the models that perform well are MoE models where there may be an Expert or two that does well on tasks that needs focus rather than intuition. So without knowing anything about Gemini Flash, my guess would be that it's an MoE model.