Comment by JJax7 - Hacker Neue

JJax7 Nov 6, 2025 parent

Love seeing this benchmark become more iconic with each new model release. Still in disbelief at the GPT-5 variants' performance in comparison but its cool to see the new open source models get more ambitious with their attempts.

aqme28 Nov 6, 2025

Only until they start incorporating this test into their training data.

orbital-decay Nov 6, 2025

Dataset contamination alone won't get them good-looking SVG pelicans on bicycles though, they'll have to either cheat this particular question specifically or train it to make vector illustrations in general. At which point it can be easily swapped for another problem that wasn't in the data.

jug Nov 7, 2025

I like this one as an alternative, also requiring using a special representation to achieve a visual result: https://voxelbench.ai

What's more, this doesn't benchmark a singular prompt.

nwienert Nov 7, 2025

they can have some cheap workers make about 10 pelicans by hand in svg, fuzz them to generate thousands of variations and throw it in their training pool. don't need to 'get good at svgs' by any means.

an0malous Nov 6, 2025

Why is this a benchmark though? It doesn’t correlate with intelligence

simonw Nov 6, 2025

It started as a joke, but over time performance on this one weirdly appears to correlate to how good the models are generally. I'm not entirely sure why!

behnamoh Nov 6, 2025

it has to do with world model perception. these models don't have it but some can approximate it better than others.

dmonitor Nov 6, 2025

It's simple enough that a person can easily visualize the intended result, but weird enough that generative AI struggles with it

JJax7 OP Nov 6, 2025

I'm not saying its objective or quantitative, but I do think its an interesting task because it would be challenging for most humans to come up with a good design of a pelican riding a bicycle.

also: NITPICKER ALERT

beepbooptheory Nov 6, 2025

I think its cool and useful precisely because its not trying to correlate intelligence. It's a weird kind of niche thing that at least intuitively feels useful for judging llms in particular.

I'd much prefer a test which measures my cholesterol than one that would tell me whether I am an elf or not!

HighGoldstein Nov 6, 2025

What test would be better correlated with intelligence and why?

ok_dad Nov 6, 2025

When the machines become depressed and anxious we'll know they've achieved true intelligence. This is only partly a joke.

jiggawatts Nov 6, 2025

This already happens!

There have been many reports of CLI AI tools getting frustrated, giving up, and just deleting the whole codebase in anger.

lukan Nov 6, 2025

There are many reports of CLI AI tools displaying words that humans express when they are frustrated and about to give up. Just what they have been trained on. That does not mean they have emotions. And "deleting the whole codebase" sounds more interesting, but I assume is the same thing. "Frustrated" words lead to frustrated actions. Does not mean the LLM was frustrated. Just that in its training data those things happened so it copied them in that situation.

6 More Comments →

an0malous Nov 6, 2025

A mathematical exam problem not in the training set because mathematical and logical reasoning are usually what people mean by intelligence.

I don’t think Einstein or von Neumann could do this SVG problem, does that mean they’re dumb?

K0balt Nov 7, 2025

I actually prefer ascii art diagrams as a benchmark for visual thinking, since it requires 2 stages, Like svg, and also can test imaginative repurposing of text elements.

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous