pu_pe parent
Quite clever and useful benchmark. This implies that without tool use, LLMs have a fundamental limitation when it comes to tasks like code review.
I'd say that's where we're headed. A big model that's trained from the start to use tools and know when to use certain tools and how to use tools. Like us :)
I wouldn't be surprised if someone's building a dataset for tool use examples.
The newer gen reasoning models are especially good at knowing when to do web search. I imagine they'll slowly get better at other tools.
At current levels of performance, LLMs having the ability to get well curated information by themselves would increase their scores by a lot.