Comment by pedrosorio

pedrosorio Sep 13, 2024 parent

> It knows english at or above a level equal to most fluent speakers, and it also can produce output that is not just a likely output, but is a logical output

This is not an apt description of the system that insists the doctor is the mother of the boy involved in a car accident when elementary understanding of English and very little logic show that answer to be obviously wrong.

https://x.com/colin_fraser/status/1834336440819614036

ramraj07 Sep 13, 2024

Many of my PhD and post doc colleagues who emigrated from Korea, China and India who didn’t have English as the medium of instruction would struggle with this question. They only recover when you give them a hint. They’re some of the smartest people in general. If you try to stop stumping these models with trick questions and ask it straightforward reasoning systems it is extremely performant (O1 is definitely a step up though not revolutionary in my testing).

maeil Sep 13, 2024

I live in one of the countries you mentioned and just showed it to one of my friends who's a local who struggles with English. They had no problem concluding that the doctor was the child's dad. Full disclosure, they assumed the doctor was pretending to be the child's dad, which is also a perfectly sound answer.

djur Sep 13, 2024

The claim was that "it knows english at or above a level equal to most fluent speakers". If the claim is that it's very good at producing reasonable responses to English text, posing "trick questions" like this would seem to be a fair test.

andreasmetsala Sep 13, 2024

Does fluency in English make someone good at solving trick questions? I usually don’t even bother trying but mostly because trick questions don’t fit my definition of entertaining.

rdtsc Sep 13, 2024

Fluency is a necessary but not the only prerequisite.

To be able to answer a trick question, it’s first necessary to understand the question.

accountnum Sep 13, 2024

No, it's necessary to either know that it's a trick question or to have a feeling that it is based on context. The entire point of a question like that is to trick your understanding.

You're tricking the model because it has seen this specific trick question a million times and shortcuts to its memorized solution. Ask it literally any other question, it can be as subtle as you want it to be, and the model will pick up on the intent. As long as you don't try to mislead it.

I mean, I don't even get how anyone thinks this means literally anything. I can trick people who have never heard of the trick with the 7 wives and 7 bags and so on. That doesn't mean they didn't understand, they simply did what literally any human does, make predictions based on similar questions.

6 More Comments →

KoolKat23 Sep 13, 2024

It's knowledge is broad and general, it does not have insight into the specifics of a person's discussion style, there are many humans that struggle with distinguishing sarcasm for instance. Hard to fault it for not being in alignment with the speaker and their strangely phrased riddle.

It answers better when told "solve the below riddle".

joedwin Sep 13, 2024

lol, I am neither a PhD nor a postdoc, but I am from India . I could understand the problem.

ramraj07 Sep 13, 2024

Did you have English as your medium of instruction? If yes, do you see the irony that you also couldn’t read two sentences and see the facts straight?

raincole Sep 13, 2024

I think you have particularly dumb colleagues then. If you post this question to an average STEM PhD in China (not even from China. In China) they'll get it right.

This question is the "unmisleading" version of a very common misleading question about sexism. ChatGPT learned the original, misleading version too well that it can't answer the unmisleading version.

Humans who don't have the original version ingrained in their brains will answer it with ease. It's not even a tricky question to humans.

fragmede Sep 13, 2024

> it can't answer the unmisleading version.

Yes it can: https://chatgpt.com/share/66e3601f-4bec-8009-ac0c-57bfa4f059...

multjoy Sep 13, 2024

“Don’t be mean to LLMs, it isn’t their fault that they’re not actually intelligent”

K0balt Sep 13, 2024

In general LLMs seem to function more reliably when you use pleasant language and good manners with them. I assume this is because because the same bias also shows up in the training data.

lupire Sep 13, 2024

"Don't anthropomorphize LLMs. They're hallucinating when they say they love that."

bonoboTP Sep 13, 2024

This illustrates a different point. This is a variation on a well known riddle that definitely comes up in the training corpus many times. In the original riddle a father and his son die in the car accident and the idea of the original riddle is that people will be confused how the boy can be the doctor's son if the boy's father just died, not realizing that women can be doctors too and so the doctor is the boy's mother. The original riddle is aimed to highlight people's gender stereotype assumptions.

Now, since the model was trained on this, it immediately recognizes the riddle and answers according to the much more common variant.

I agree that this is a limitation and a weakness. But it's important to understand that the model knows the original riddle well, so this is highlighting a problem with rote memorization/retrieval in LLMs. But this (tricky twists in well-known riddles that are in the corpus) is a separate thing from answering novel questions. It can also be seen as a form of hypercorrection.

ImHereToVote Sep 13, 2024

My codebases are riddled with these gotchas. For instance, I sometimes write Python for the Blender rendering engine. This requires highly non-idiomatic Python. Whenever something complex comes up, LLM's just degenerate to cookie cutter basic bitch Python code. There is simply no "there" there. They are very useful to help you reason about unfamiliar codebases though.

bonoboTP Sep 13, 2024

For me the best coding use case is getting up to speed in an unfamiliar library or usage. I describe the thing I want and get a good starting point and often the cookie-cutter way is good enough. The pre-LLM alternative would be to search for tutorials but they will talk about some slightly different problem with different goals etc then you have to piece it together, and the tutorial assumes you already know a bunch of things like how to initialize stuff and skips the boilerplate and so on.

Now sure, actually working through it will give a deeper understanding that might come handy at a later point, but sometimes the thing is really a one-off and not an important point. Like as an AI researcher I sometimes want to draft up a quick demo website, or throw together a quick Qt GUI prototype or a Blender script or use some arcane optimization library or write a SWIG or a Cython wrapper around a C/C++ library to access it in Python, or how to stuff with Lustre, or the XFS filesystem or whatever. Any number of small things where, sure, I could open the manual, do some trial and error, read stack overflow, read blogs and forums, OR I could just use an LLM, use my background knowledge to judge whether it looks reasonable, then verify it, use the now obtained key terms to google more effectively etc. You can't just blindly copy-paste it and you have to think critically and remain in the driver seat. But it's an effective tool if you know how and when to use it.

ryanjshaw Sep 13, 2024

1. It didn't insist anything. It got the semi-correct answer when I tried [1]; note it's a preview model, and it's not a perfect product.

(a) Sometimes things are useful even when imperfect e.g. search engines.

(b) People make reasoning mistakes too, and I make dumb ones of the sort presented all the time despite being fluent in English; we deal with it!

I'm not sure why there's an expectation that the model is perfect when the source data - human output - is not perfect. In my day-to-day work and non-work conversations it's a dialogue - a back and forth until we figure things out. I've never known anybody to get everything perfectly correct the first time, it's so puzzling when I read people complaining that LLMs should somehow be different.

2. There is a recent trend where sex/gender/pronouns are not aligned and the output correctly identifies this particular gotcha.

[1] I say semi-correct because it states the doctor is the "biological" father, which is an uncorroborated statement. https://chatgpt.com/share/66e3f04e-cd98-8008-aaf9-9ca933892f...

hmottestad Sep 13, 2024

Reminds me of a trick question about Schrödinger's cat.

“I’ve put a dead cat in a box with a poison and an isotope that will trigger the poison at a random point in time. Right now, is the cat dead or alive?”

The answer is that the cat is dead, because it was dead to begin with. Understanding this doesn’t mean that you are good at deductive reasoning. It just means that I didn’t manage to trick you. Same goes for an LLM.

maeil Sep 13, 2024

There is no "trick" in the linked question, unlike the question you posed.

The trick in yours also isn't a logic trick, it's a redirection, like a sleight of hand in a card trick.

bonoboTP Sep 13, 2024

Yes there is. The trick is that the more common variant of this riddle says that a boy and his father are in the car accident. That variant of the riddle certainly comes up a lot in the training data, which is directly analogous to the Schrödinger case from above where smuggling in the word "dead" is analogous to swapping father to mother in the car accident riddle.

I think many here are not aware that the car accident riddle is well known with the father dying where the real solution is indeed that the doctor is the mother.

ryanjshaw Sep 13, 2024

There is a trick. The "How is this possible?" primes the LLM that there is some kind of trick, as that phrase wouldn't exist in the training data outside of riddles and trick questions.

hmottestad Sep 13, 2024

The trick in the original question is that it's a twist on the original riddle where the doctor is actually the boys mother. This is a fairly common riddle and I'm sure the LLM has been trained on it.

lucubratory Sep 13, 2024

Yeah, I think what a lot of people miss about these sort of gotchas are that most of them were invented explicitly to gotcha humans, who regularly get got by them. This is not a failure mode unique to LLMs.

roywiggins Sep 13, 2024

One that trips up LLMs in ways that wouldn't trip up humans is the chicken, fox and grain puzzle but with just the chicken. They tend to insist that the chicken be taken across the river, then back, then across again, for no reason other than the solution to the classic puzzle requires several crossings. No human would do that, by the time you've had the chicken across then even the most unobservant human would realize this isn't really a puzzle and would stop. When you ask it to justify each step you get increasingly incoherent answers.

Has anyone tried this on o1?

hmottestad Sep 13, 2024

Here you go: https://chatgpt.com/share/66e48de6-4898-800e-9aba-598a57d27f...

Seemed to handle it just fine.

Kinda a waste of a perfectly good LLM if you ask me. I've mostly been using it as a coding assistant today and it's been absolutely great. Nothing too advanced yet, mostly mundane changes that I got bored of having to make myself. Been giving it very detailed and clear instructions, like I would to a Junior developer, and not giving it too many steps at once. Only issue I've run into is that it's fairly slow and that breaks my coding flow.

mewpmewp2 Sep 13, 2024

If there is attention mechanism then maybe that is what is fault, because if it is a common riddle attention mechanism only notices that it is a common riddle, not that there is a gotcha planted in. Because when I read the sentence myself, I did not immediately notice that the cat that was put in there was actually dead when it was put there, because I pattern matched this to a known problem, I did not think I need to pay logical attention to each word, word by word.

ryanjshaw Sep 13, 2024

Yes it's so strange seeing people who clearly know these are 'just' statistical language models pat themselves on the back when they find limits on the reasoning capabilities - capabilities which the rest of us are pleasantly surprised exist to the extent they do in a statistical model, and happy to have access to for $20/mo.

rainsford Sep 13, 2024

It's because at least some portion of "the rest of us" talk as if LLMs are far more capable than they really are and AGI is right around the corner, if not here already. I think the gotchas that play on how LLMs really work serve as a useful reminder that we're looking at statistical language models, not sentient computers.

achow Sep 13, 2024

What I'm not able to comprehend is why people are not seeing the answer as brilliant!

Any ordinary mortal (like me) would have jumped to the conclusion that answer is "Father" and would have walked away patting on my back, without realising that I was biased by statistics.

Whereas o1, at the very outset smelled out that it is a riddle - why would anyone out of blue ask such question. So, it started its chain of thought with "Interpreting the riddle" (smart!).

In my book that is the difference between me and people who are very smart and are generally able to navigate the world better (cracking interviews or navigating internal politics in a corporate).

grey-area Sep 13, 2024

The 'riddle': A woman and her son are in a car accident. The woman is sadly killed. The boy is rushed to hospital. When the doctor sees the boy he says "I can't operate on this child, he is my son". How is this possible?

GPT Answer: The doctor is the boy's mother

Real Answer: Boy = Son, Woman = Mother (and her son), Doctor = Father (he says...he is my son)

This is not in fact a riddle (though presented as one) and the answer given is not in any sense brilliant. This is a failure of the model on a very basic question, not a win.

It's non deterministic so might sometimes answer correctly and sometimes incorrectly. It will also accept corrections on any point, even when it is right, unlike a thinking being when they are sure on facts.

LLMs are very interesting and a huge milestone, but generative AI is the best label for them - they generate statistically likely text, which is convincing but often inaccurate and it has no real sense of correct or incorrect, needs more work and it's unclear if this approach will ever get to general AI. Interesting work though and I hope they keep trying.

kasdfasH Sep 13, 2024

The original riddle is of course:

"A father and his son are in a car accident [...] When the boy is in hospital, the surgeon says: This is my child, I cannot operate on him".

In the original riddle the answer is that the surgeon is female and the boy's mother. The riddle was supposed to point out gender stereotypes.

So, as usual, ChatGPT fails to answer the modified riddle and gives the plagiarized stock answer and explanation to the original one. No intelligence here.

TeMPOraL Sep 13, 2024

> So, as usual, ChatGPT fails to answer the modified riddle and gives the plagiarized stock answer and explanation to the original one. No intelligence here.

Or, fails in the same way any human would, when giving a snap answer to a riddle told to them on the fly - typically, a person would recognize a familiar riddle half of the first sentence in, and stop listening carefully, not expecting the other party to give them a modified version.

It's something we drill into kids in school, and often into adults too: read carefully. Because we're all prone to pattern-matching the general shape to something we've seen before and zoning out.

grey-area Sep 14, 2024

I'm curious what you think is happening here as your answer seems to imply it is thinking (and indeed rushing to an answer somehow). Do you think the generative AI has agency or a thought process? It doesn't seem to have anything approaching that to me, nor does it answer quickly.

It seems to be more like a weighing machine based on past tokens encountered together, so this is exactly the kind of answer we'd expect on a trivial question (I had no confusion over this question, my only confusion was why it was so basic).

It is surprisingly good at deceiving people and looking like it is thinking, when it only performs one of the many processes we use to think - pattern matching.

2 More Comments →

pedrosorio OP Sep 18, 2024

> Or, fails in the same way any human would, when giving a snap answer to a riddle told to them on the fly

The point of o1 is that it's good at reasoning because it's not purely operating in the "giving a snap answer on the fly" mode, unlike the previous models released by OpenAI.

accountnum Sep 13, 2024

It literally is a riddle, just as the original one was, because it tries to use your expectations of the world against you. The entire point of the original, which a lot of people fell for, was to expose expectations of gender roles leading to a supposed contradiction that didn't exist.

You are now asking a modified question to a model that has seen the unmodified one millions of times. The model has an expectation of the answer, and the modified riddle uses that expectation to trick the model into seeing the question as something it isn't.

That's it. You can transform the problem into a slightly different variant and the model will trivially solve it.

jfengel Sep 13, 2024

Phrased as it is, it deliberately gives away the answer by using the pronoun "he" for the doctor. The original deliberately obfuscates it by avoiding pronouns.

So it doesn't take an understanding of gender roles, just grammar.

accountnum Sep 13, 2024

My point isn't that the model falls for gender stereotypes, but that it falls for thinking that it needs to solve the unmodified riddle.

Humans fail at the original because they expect doctors to be male and miss crucial information because of that assumption. The model fails at the modification because it assumes that it is the unmodified riddle and misses crucial information because of that assumption.

In both cases, the trick is to subvert assumptions. To provoke the human or LLM into taking a reasoning shortcut that leads them astray.

You can construct arbitrary situations like this one, and the LLM will get it unless you deliberately try to confuse it by basing it on a well known variation with a different answer.

I mean, genuinely, do you believe that LLMs don't understand grammar? Have you ever interacted with one? Why not test that theory outside of adversarial examples that humans fall for as well?

2 More Comments →

roomey Sep 13, 2024

Why couldn't the doctor be the boys mother?

There is no indication of the sex of the doctor, and families that consist of two mothers do actually exist and probably doesn't even count as that unusual.

singingfish Sep 13, 2024

Speaking as a 50-something year old man whose mother finished her career in medicine and the very pointy end of politics, when I first heard this joke in the 1980s it stumped me and made me feel really stupid. But my 1970s kindergarten class mates who told me “your mum can’t be a doctor, she has to be a nurse” were clearly seriously misinformed then. I believe that things are somewhat better now but not as good as they should be …

eigenket Sep 13, 2024

"When the doctor sees the boy he says"

Indicates the gender of the father.

stavros Sep 13, 2024

Ah, but have you considered the fact that he's undergone a sex change operation, and was actually originally a female, the birth mother? Elementary, really...

yreg Sep 13, 2024

A mother can have a male gender.

I wonder if this interpretation is a result of attempts to make the model more inclusive than the corpus text, resulting in a guess that's unlikely, but not strictly impossible.

3 More Comments →

kristianp Sep 13, 2024

So the riddle could have two answers: mother or father? Usually riddles have only one definitive answer. There's nothing in the wording of the riddle that excludes the doctor being the father.

grey-area Sep 13, 2024

This particular riddle the answer is the doctor is the father.

grey-area Sep 13, 2024

he says

lanstin Sep 13, 2024

"There are four lights"- GPT will not pass that test as is. I have done a bunch of homework with Claude's help and so far this preview model has much nicer formatting but much the same limits of understanding the maths.

pkage Sep 13, 2024

I mean, it's entirely possible the boy has two mothers. This seems like a perfectly reasonable answer from the model, no?

eigenket Sep 13, 2024

The text says "When the doctor sees the boy he says"

The doctor is male, and also a parent of the child.

yywwbbn Sep 13, 2024

> why would anyone out of blue ask such question

I would certainly expect any person to have the same reaction.

> So, it started its chain of thought with "Interpreting the riddle" (smart!).

How is that smarter than intuitively arriving at the correct answer without having to explicitly list the intermediate step? Being able to reasonably accurately judge the complexity of a problem with minimal effort seems “smarter” to me.

ImHereToVote Sep 13, 2024

The doctor is obviously a parent of the boy. The language tricks simply emulate the ambiance of reasoning. Similarly to a political system emulating the ambiance of democracy.

geysersam Sep 13, 2024

Come on. Of course chatgpt has read that riddle and the answer 1000 times already.

accountnum Sep 13, 2024

It hasn't read that riddle because it is a modified version. The model would in fact solve this trivially if it _didn't_ see the original in its training. That's the entire trick.

geysersam Sep 13, 2024

Sure but the parent was praising the model for recognizing that it was a riddle in the first place:

> Whereas o1, at the very outset smelled out that it is a riddle

That doesn't seem very impressive since it's (an adaptation of) a famous riddle

The fact that it also gets it wrong after reasoning about it for a long time doesn't make it better of course

accountnum Sep 13, 2024

Recognizing that it is a riddle isn't impressive, true. But the duration of its reasoning is irrelevant, since the riddle works on misdirection. As I keep saying here, give someone uninitiated the 7 wives with 7 bags going (or not) to St Ives riddle and you'll see them reasoning for quite some time before they give you a wrong answer.

If you are tricked about the nature of the problem at the outset, then all reasoning does is drive you further in the wrong direction, making you solve the wrong problem.

ryanjshaw Sep 13, 2024

Why does it exist 1000 times in the training if there isn't some trick to it, i.e. some subset of humans had to have answered it incorrectly for the meme to replicate that extensively in our collective knowledge.

And remember the LLM has already read a billion other things, and now needs to figure out - is this one of them tricky situations, or the straightforward ones? It also has to realize all the humans on forums and facebook answering the problem incorrectly are bad data.

Might seem simple to you, but it's not.

KoolKat23 Sep 13, 2024

I'm noticing a strange common theme in all these riddles, it's being asked and getting wrong.

They're all badly worded questions. The model knows something is up and reads into it too much. In this case it's tautology, you would usually say "a mother and her son...".

I think it may answer correctly if you start off asking "Please solve the below riddle:"

There was another example yesterday which it solved correctly after this addition.(In that case the point of views were all mixed up, it only worked as a riddle).

bnralt Sep 13, 2024

> They're all badly worded questions. The model knows something is up and reads into it too much. The model knows something is up and reads into it too much. In this case it's tautology, you would usually say "a mother and her son...".

How is "a woman and her son" badly worded? The meaning is clear and blatently obvious to any English speaker.

KoolKat23 Sep 15, 2024

Go read the whole riddle, add the rest of it and you'll see it's contrived, hence it's a riddle even for humans. The model in it's thinking (which you can read) places undue influence on certain anomalous factors. In practice, a person would say this way more eloquently than the riddle.

TeMPOraL Sep 13, 2024

Yup. The models fail on gotcha questions asked without warning, especially when evaluated on the first snap answer. Much like approximately all humans.

Jensson Sep 13, 2024

> especially when evaluated on the first snap answer

The whole point of o1 is that it wasn't "the first snap answer", it wrote half a page internally before giving the same wrong answer.

grey-area Sep 13, 2024

Is that really its internal 'chain of thought' or is it a post-hoc justification generated afterward? Do LLMs have a chain of thought like this at all or are they just convincing at mimicking what a human might say if asked for a justification for an opinion?

KoolKat23 Sep 15, 2024

Its slightly more strange than this as both are true. It's already baked in the model but chain of thought does improve reasoning, you only have to look at maths problems. A short guess would be wrong but it would get it correct if asked to break it down and reason (harder to see nowadays as it has access to calculators).

anon291 Sep 13, 2024

Keep in mind that the system always chooses randomly so there is always a possibility it commits to the wrong output.

I don't know why openAi won't allow determinism but it doesn't, even with temperature set to zero

maeil Sep 13, 2024

Nondeterminism provides an excuse for errors, determinism doesn't.

Nondeterminism scores worse with human raters, because it makes output sound even more robotic and less human.

coffeebeqn Sep 13, 2024

Would picking deterministically help through? Then in some cases it’s always 100% wrong

jaredsohn Sep 13, 2024

Yes, it is better if for example using it via an API to classify. Deterministic behavior makes it a lot easier to debug the prompt.

roywiggins Sep 13, 2024

Determinism only helps if you always ask the question with exactly the same words. There's no guarantee a slightly rephrased version will give the same answer, so a certain amount of unpredictability is unavoidable anyway. With a deterministic LLM you might find one phrasing that always gets it right and a dozen basically indistinguishable ones that always get it wrong.

anon291 Sep 16, 2024

My program always asks the same question yes.

fragmede Sep 13, 2024

what's weird is it gets it right when I try it.

https://chatgpt.com/share/66e3601f-4bec-8009-ac0c-57bfa4f059...

latexr Sep 13, 2024

That’s not weird at all, it’s how LLMs work. They statistically arrive at an answer. You can ask it the same question twice in a row in different windows and get opposite answers. That’s completely normal and expected, and also why you can never be sure if you can trust an answer.

rtakha Sep 13, 2024

Perhaps OpenAI hot-patches the model for HN complaints:

  def intercept_hn_complaints(prompt):
    if is_hn_trick_prompt(prompt):
       # special_case for known trick questions.

fragmede Sep 13, 2024

While that's not impossible, what we know of how the technology works (ie very costly training run followed by cheap inference steps) means that's not feasible, given all the possible variations of the question * is_hn_trick_prompt* would have to cover because there's a near infinite variations on how you'd word the prompt. (Eg The first sentence could be reworded to be "A woman and her son are in a car accident. " to "A woman and her son are in the car when they get into a crash.")

brna-2 Sep 13, 2024

Waat, got it on second try:

This is possible because the doctor is the boy's other parent—his father or, more likely given the surprise, his mother. The riddle plays on the assumption that doctors are typically male, but the doctor in this case is the boy's mother. The twist highlights gender stereotypes, encouraging us to question assumptions about roles in society.

brna-2 Sep 13, 2024

Yep. correct and correct.

https://chatgpt.com/share/66e3de94-bce4-800b-af45-357b95d658...

empath75 Sep 13, 2024

The reason why that question is a famous question is that _many humans get it wrong_.

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous