Comment by rapfaria - Hacker Neue

rapfaria Sep 29, 2025 parent

His "pelican riding a bicycle" tests are now a classic and AI shops are benchmaxxing for it

simonw Sep 29, 2025

They need to benchmaxxx a whole lot harder, the illustrations still all universally suck!

lxgr Sep 29, 2025

I fully expect a model to output a SVG made up of 1000x1000 rectangles (i.e. pixels) representing a raster image of a beautifully hand-drawn pelican riding a bicycle any day now :)

simonw Sep 29, 2025

I got an amazing result from ChatGPT a while back - an SVG with a perfect illustration of a pelican riding a bicycle.

It was suspiciously good in fact... so I downloaded the SVG file and found out it had generated a raster image with its image tool and then embedded it as base64 binary image data inside an SVG wrapper!

dhhugley Sep 30, 2025

You’ll just have to move the goalpost then; perhaps it can be a multidimensional pelican saving the multiverse, or an invisible pelican that only you can see and critique.

lxgr Sep 30, 2025

How would that help, given that ChatGPT has apparently already figured out how to consistently and systematically game the benchmark by working in pixel space and only using SVG as a wrapper for a raster image?

FWIW, I could totally see a not hugely more advanced model using its native image generation capabilities and then running a vector extraction tool on it, maybe iteratively. (And maybe I would not consider that cheating, anymore, since at some point that probably resembles what humans do?)

sixeyes Sep 30, 2025

ive got such pixelated rectangle SVG's a few times.

also with cursor, "write me a script that outputs X as an svg" it has given me rectangles a few times.

astrange Sep 29, 2025

If they were testing that it'd work more often.

Other things you can ask that they're still clearly not optimizing for are ASCII art and directions between different locations. Complete fabrications 100% of the time.

Sharlin Sep 29, 2025

Well, I definitely hope they aren't trying to teach LLMs directions between locations, given how idiotic use of compute and parameter space that would be. We already have excellent AIs for route planning. What they ought to optimize for is, of course, finally teaching them to say they don't know, or just automatically opting to call a route-planning API if the user asks for directions.

This item has no comments currently.