Preferences

Quite clever and useful benchmark. This implies that without tool use, LLMs have a fundamental limitation when it comes to tasks like code review.

iknownothow
I'd say that's where we're headed. A big model that's trained from the start to use tools and know when to use certain tools and how to use tools. Like us :)

I wouldn't be surprised if someone's building a dataset for tool use examples.

The newer gen reasoning models are especially good at knowing when to do web search. I imagine they'll slowly get better at other tools.

At current levels of performance, LLMs having the ability to get well curated information by themselves would increase their scores by a lot.

This item has no comments currently.