Comment by gerdesj - Hacker Neue

gerdesj Feb 1, 2025 parent

Are you implying it isn't?

(evidence please, everyone)

BinRoo Feb 1, 2025

Simple example: o3-mini-high gets this [1] right, whereas Gemini 2.0 Flash 01-21 gets it wrong.

[1] https://chatgpt.com/share/679d9579-5bb8-8008-ac4a-38cef65b45...

xnx Feb 1, 2025

Great example. Thank you. Can confirm that none of the Gemini models warned about the exception without prompting.

maeil Feb 1, 2025

This agrees with my limited testing so far, but in a different way: o3 being better at coding and objective tasks, with the most recent Flash 2.0-thinking stronger at subjective tasks. Similarly, o3 seems better at shorter output sizes, but drops off, tending to be lazy.

This item has no comments currently.