Comment by enragedcacti

enragedcacti 3 days ago parent

To what end? You have to segment and order the document (i.e. solve the problem) just to craft your prompt so the LLM spitting the solution back to you is useless. The experiment uses these tasks because test cases can be algorithmically generated and scored, but it's not very interesting that one can structure the input to solve this specific, useless task with LLMs. It is interesting, though, that this limitation could carry over into tasks where traditional algorithms fail. LLMs improving at this would be legitimately useful which is why a benchmark makes sense, but cheating the benchmarks by augmenting the input doesn't.