No surprise that these products are all dreamt up by powerful tech CEOs who are used to all of their human interactions being with servile people-pleasers. I bet each and every one of them are subtly or overtly shaped by feedback from executives about how they should respond to conversation.
You may be on to something there: the guys and gals that build this stuff may very well be imbibing these products with the kind of attitude that they like to see in their subordinates. They're cosplaying the 'eager to please' element to the point of massive irritation and left out the one feature that could serve to redeem such behavior which is competence.
Or that the individual developers see in themselves. Every team I've worked with in my career had one or two of these guys: When the Director or VP came in to town, they'd instantly launch into brown-nose mode. One guy was overt about it and would say things like "So-and-so is visiting the office tomorrow--time to do some petting!" Both the executive and the subordinate have normalized the "royal treatment" on the giving and receiving end.
LLMs have an entire wrapper around them tuned to be as engaging as possible. Most people's experience of LLMs is a strongly social media and engagement economy influenced design.
Putin didn't think Russia could take Ukraine in 3 days with literal celebration by the populace because he only works with honest folks for example.
Rich people get disconnected from reality because people who insist on speaking truth and reality around them tend to stop getting invited to the influence peddling sessions.
1: https://en.wikipedia.org/wiki/The_Emperor%27s_New_Clothes
Which makes it more interesting. Apparently reddit was a particularly hefty source for most LLMs; your average reddit conversation is absolutely nothing like this.
Separate observation: That kind of semi-slimey obsequious behaviour annoys me. Significantly so. It raises my hackles; I get the feeling I'm being sold something on the sly. Even if I know the content in between all the sycophancy is objectively decent, my instant emotional response is negative and I have to use my rational self to dismiss that part of the ego.
But I notice plenty of people around me that respond positively to it. Some will even flat out ignore any advice if it is not couched in multiple layers of obsequious deference.
Thus, that raises a question for me: Is it innate? Are all people placed on a presumably bell-curve shaped chart of 'emotional response to such things', with the bell curve quite smeared out?
Because if so, that would explain why some folks have turned into absolute zealots for the AI thing, on both sides of it. If you respond negatively to it, any serious attempt to play with it should leave you feeling like it sucks to high heavens. And if you respond positively to it - the reverse.
Idle musings.
But I'm also showing off my ignorance with how these machines turn text into tokens in practice.
Thanks!
As flawed as they are currently, I remain astounded that people think they will never improve and that people don't want a plastic pal who's fun to be with(tm).
I find them frustrating personally, but then I ask them deep technical questions on obscure subjects and I get science fiction in return.
And once this garbage is in your context, it's polluting everything that comes after. If they don't know, I need them to shut up. But they don't know when they don't know. They don't know shit.
"To add two numbers I must first simulate the universe." types that created a bespoke DSL for every problem. Software engineering is a field full of educated idiots.
Programmers really need to stop patting themselves on the back. Same old biology with the same old faults. Programmers are subjected to the same old physics as everyone else.
The real human position would be to be an ass-kisser who hangs on every word you say, asks flattering questions to keep you talking, and takes copious notes to figure out how they can please you. LLMs aren't taking notes correctly yet, and they don't use their notes to figure out what they should be asking next. They're just constantly talking.
Only in the same way that all humans behave the same.
You can prompt an LLM to talk to you however you want it to, it doesn't have to be nice to you.
Again, this isn't really true with some recent models. Some have the opposite problem.
Sounds like you don't know how RLHF works. Everything you describe is post-training. Base models can't even chat, they have to be trained to even do basic conversational turn taking.
Having just read a load of Quora answers like this, which did not cover the thing I was looking for, that is how humans on the internet behave and how people have to write books, blog posts, articles, documentation. Without the "dance" to choose a path through a topic on the fly, the author has to take the burden of providing all relevant context, choosing a path, explaining why, and guessing at any objections and questions and including those as well.
It's why "this could have been an email" is a bad shout. The summary could have been an email, but the bit which decided on that being the summary would be pages of guessing all the things which what might have been in the call and which ones to include or exclude.
THe internet really used to be efficient and i could always find exactly what i wanted with an imprecise google search ~ 15 years ago.
This is not how humans converse in human social settings. The medium is the message, as they say.
Even here, I'm responding to you on a thread that I haven't been in on previously.
There's also a lot more material out there in the format of Stack Exchange questions and answers, Quora posts, blog posts and such than there is for consistent back and forth interplay between two people.
IRC chat logs might have been better...ish.
The cadence for discussion is unique to the medium in which the discussion happens. What's more, the prompt may require further investigation and elaboration prior to a more complete response, while other times it may be something that requires story telling and making it up as it goes.
I had the same issue with Kagi, where I'd follow the citation and it would say the opposite of the summary.
A human can make sense of search results with a little time and effort, but current AI models don't seem to be able to.
"I ask a short and vague question and you response with a scrollbar-full of information based on some invalid assumptions" is not, by any reasonable definition, a "chat".
This also makes me curious to what degree this phenomenon manifests when interacting with LLMs in languages other than English? Which languages have less tendency toward sycophantic confidence? More? Or does it exist at a layer abstracted from the particular language?
Doesn't matter if you tell it "that's not correct and neither is ____ so don't try that instead," it likes those two answers and it's going to keep using them.
The only reliable way to recover is to edit your previous question to include the clarification, and let it regenerate the answer.
And less than two weeks in they removed it and replaced it with some sort of "plain and clear" personality which is human-like. And my frustrations ramped up again.
That brief experiment taught me two things: 1. I need to ensure that any robots/LLMs/mech-turks in my life act at least as cold and rational as Data from Star Trek. 2. I should be running my own LLM locally to not be at the whims of $MEGACORP.
I approve of this, but in your place I'd wait for hardware to become cheaper when the bubble blows over. I have a i9-10900, and bought an M.2 SSD and 64GB of RAM in july for it, and get useful results with Qwen3-30B-A3B (some 4-bit quant from unsloth running on llama.cpp).
It's much slower than an online service (~5-10 t/s), and lower quality, but it still offers me value for my use cases (many small prototypes and tests).
In the mean time, check out LLM service prices on https://artificialanalysis.ai/ Open source ones are cheap! Lower on the homepage there's a Cost Efficiency section with a Cost vs Intelligence chart.
https://dev.to/composiodev/qwen-3-vs-deep-seek-r1-evaluation...
If it runs, it looks like I can get a bit more quality. Thanks for the suggestion.
I recently introduced a non-technical person to Claude Code, and this non-human behavior was a big sticking point. They tried to talk to Claude similar as to a human, presenting it one piece of information at a time. With humans this is generally beneficial, and they will either nod for you to continue or ask clarifying questions. With Claude this does not work well, you have to infodump as much as possible in each message
So even from a perspective of "how do we make this automaton into the best tool", a more human-like conversation flow might be beneficial. And that doesn't seem beyond the technological capabilities at all, it's just not what we encourage in today's RLHF
A lazy example:
"This goal of this project is to do x. Let's prepare a .md file where we spec out the task. Ask me a bunch of questions, one at a time, to help define the task"
Or you could just ask it to be more conversational, instead of just asking questions. It will do that.
I'm prompting Gemini, and I write:
I have the following code, can you help me analyze it? <press return>
<expect to paste the code into my chat window>
but Gemnini is already generating output, usually saying "I'm waiting for you to enter the code"
Maybe we want a smaller model tuned for back and forth to help clarify the "planning doc" email. Makes sense that having it all in a single chat-like interface would create confusion and misbehavior.
I'd be surprised if you can't already make current models behave like that with an appropriate system prompt.
A strictly machinelike tool doesn't begin answers by saying "Great question!"
The same goes for "rules" - you train an LLM with trillions of tokens and try to regulate its behavior with thousands. If you think of a person in high school, grading and feedback is a much higher percentage of the training.
What bothers me the most is the seemingly unshakable tendency of many people to anthropomorphise this class of software tool as though it is in any way capable of being human.
What is it going to take? Actual, significant loss of life in a medical (or worse, military) context?
My calculator is a great tool but it is not a mathematician. Not by a long shot.
I think it's important to be skeptical and push back against a lot of the ridiculous mass-adoption of LLMs, but not if you can't actually make a well-reasoned point. I don't think you realize the damage you do when the people gunning for mass proliferation of LLMs in places they don't belong can only find examples of incoherent critique.
An untrained and unspecialised human can be trained quickly and reliably for the cost of meals and lodging and will very likely actually try to do the right thing because of personal accountability.
Delegating responsibility to badly-designed or outright unfit-for-purpose systems because of incoherent confidence is plainly a bad plan.
As for the other nuances of your post, I will assume the best intention.
On the project I did work on, reviewers were not allowed to e.g. answer that they didn't know - they had to provide an answer to every prompt provided. And so when auditing responses, a lot of difficult questions had "confidently wrong" answers because the reviewer tried and failed, or all kinds of evasive workarounds because they knew they couldn't answer.
Presumbly these providers will eventually understand (hopefully already has - this was a year ago) that they also need to train the models to understand when the correct answer is "I don't know", or "I'm not sure. I think maybe X, but ..."
And I literally saw the effect of this first hand, in seeing how the project I worked on was actively part of training this behaviour into a major model.
As for your assertion they don't learn the more complex dynamics, that was trite and not true already several years ago.
In a way it's benchmaxxing because people like subservient beings that help them and praise them. People want a friend, but they don't want any of that annoying friction that comes with having to deal with another person.
The only thing I care about are whether the answer helps me out and how much I paid for it, whether it took the model a million tokens or one to get to it.
Maximizing the utility of your product for users is usually the winning strategy.
ChatGPT deep research does this but it’s weird and forced because it asks one series of questions and then goes off to the races, spending a half hour or more building a report. It’s frustrating if you don’t know what to expect and my wife got really mad the first time she wasted a deep research request asking it “can you answer multiple series of questions?” Or some other functionality clarifying question.
I’ve found Crusor’s plan mode extremely useful, similar to having a conversation with a junior or offshore team member who is eager to get to work but not TOO eager. These tools are extremely useful we just need to get the guard rails and user experience correct.
Another thing the vendors are selecting for is safety / PR risk. If an LLM answers to a hobby chemistry question in a matter-of-factly way, that's a disastrous PR headline in the making. If they open with several paragraphs of disclaimers or just refuse to answer, that's a win.
b) If it is, in fact, just one setting away, then I would say it's operating fairly similarly?
For example, there's been many times when they take it too literally instead of looking at the totality of the context and what was written. I'm not an LLM, so I don't have perfect grasp on every vocab term for every domain and it feels especially pandering when they repeat back the wrong word but put it in quotes or bold instead of simply asking if I meant something else.
They did at the beginning. It used to be that if you wanted a full answer with an intro, bullet points, lists of pros/cons, etc., you had to explicitly ask for it in the prompt. The answers were also a lot more influenced by the tone of the prompt instead of being forced into answering with a specific format like it does right now.
Maybe it feels a bit sad that you have follow what the LLM wants, but that's just how any tool works really.
In fact, sometimes I screenshot them and use Mac's new built-in OCR to copy them, because Manus gives me three options but they disappear if I click one, and sometimes I really like 2 or even all 3.
Setting aside the philosophical questions around "think" and "reason"... it can't.
In my mind, as I write this, I think through various possibilities and ideas that never reach the keyboard, but yet stay within my awareness.
For an LLM, that awareness and thinking through can only be done via its context window. It has to produce text that maintains what it thought about in order for that past to be something that it has moving forward.
There are aspects to a prompt that can (in some interfaces) hide this internal thought process. For example, the ChatGPT has the "internal thinking" which can be shown - https://chatgpt.com/share/69278cef-8fc0-8011-8498-18ec077ede... - if you expand the first "thought for 32 seconds" bit it starts out with:
I'm thinking the physics of gravity assists should be stable enough for me to skip browsing since it's not time-sensitive. However, the instructions say I must browse when in doubt. I’m not sure if I’m in doubt here, but since I can still provide an answer without needing updates, I’ll skip it.
(aside: that still makes me chuckle - in a question about gravity assists around Jupiter, it notes that its not time-sensitive... and the passage "I’m not sure if I’m in doubt here" is amusing)However, this is in the ChatGPT interface. If I'm using an interface that doesn't allow internal self-prompts / thoughts to be collapsed then such an interface would often be displaying code as part of its working through the problem.
You'll also note a bit of the system prompt leaking in there - "the instructions say I must browse when in doubt". For an interface where code is the expected product, then there could be system prompts that also get in there that try to always produce code.
LLM's neither think nor reason at all.
Also that the conversational behavior we see it’s just examples of conversations that we have the model to mimic so when we say “System: you are a helpful assistant. User: let’s talk. Assistant:” it will complete the text in a way that mimics a conversation?.
Yeah, we improved over that using reinforcement learning to steer the text generation into paths that lead to problem solving and more “agentic” traces (“I need to open this file the user talked about to read it and then I should run bash grep over it to find the function the user cited”), but that’s just a clever way we found to let the model itself discover which text generation paths we like the most (or are more useful to us).
So to comment on your discomfort, we (humans) trained the model to spill out answers (there are thousand of human being right now writing nicely though and formatted answers to common questions so that we can train the models on that).
If we try to train the models to mimic long dances into shared meaning we will probably decrease their utility. And we won’t be able anyway to do that because then we would have to have customized text traces for each individual instead of question-answers pairs.
Downvoters: I simplified things a lot here, in name of understanding, so bear with me.
You could say the same thing about humans.
Humans existed for 10s to 100s of thousands of years without text. or even words for that matter.
Bruh
Just yesterday I observed myself acting on an external stimulus without any internal words (this happens continuously, but it is hard to notice because we usually don't pay attention to how we do things): I sat in a waiting area of a cinema. A woman walked by and dropped her scarf without noticing. I automatically without thinking raised arm and pointer finger towards her, and when I had her attention pointed behind her. I did not have time to think even a single word while that happened.
Most of what we do does not involved any words or even just "symbols", not even internally. Instead, it is a neural signal from sensors into the brain, doing some loops, directly to muscle activation. Without going through the add-on complexity of language, or even "symbols".
Our word generator is not the core of our being, it is an add-on. When we generate words it's also very far from being a direct representation of internal state. Instead, we have to meander and iterate to come up with appropriate words for an internal state we are not even quite aware of. That's why artists came up with all kinds of experiments to better represent our internal state, because people always knew the words we produce don't represent it very well.
That is also how people always get into arguments about definitions. Because the words are secondary, and the further from the center of established meaning for some word you get the more the differences show between various people. (The best option is to drop insisting of words being the center of the universe, even just the human universe, and/or to choose words that have the subject of discussion more firmly in the center of their established use).
We are text generators in some areas, I don't doubt that. Just a few months ago I listened to some guy speaking to a small rally. I am certain that not a single sentence he said was of his own making, he was just using things he had read and parroted them (as a former East German, I know enough Marx/Engels/Lenin to recognize it). I don't want to single that person out, we all have those moments, when we speak about things we don't have any experiences with. We read text, and when prompted we regurgitate a version of it. In those moments we are probably closest to LLM output. When prompted, we cannot fall back on generating fresh text from our own actual experience, instead we keep using text we heard or read, with only very superficial understanding, and as soon as an actual expert shows up we become defensive and try to change the context frame.
> The human world model
Bruh this concept is insane
It's easy to skip and skim content you don't care about. It's hard to prod and prod to get to to say something you do care about it if the machine is traint to be very concise.
Complaining the AI can't read your mind is exceptionally high praise for the AI, frankly.
You obviously never wasted countless hours trying to talk to other people on online dating apps.
It's not "masquerading as a human". The majority of humans are functional illiterates who only understand the world through the elementary principles of their local culture.
It's the minority of the human species that take what amounts to little more than arguing semantics that need the reality check. Unless one is involved in work that directly impacts public safety (defined as harm to biology) the demand to apply one concept or another is arbitrary preference.
Healthcare, infrastructure, and essential biological support services are all most humans care about. Everything else the majority see as academic wank.
LLMs don't do this. Instead, every question is immediately responded to with extreme confidence with a paragraph or more of text. I know you can minimize this by configuring the settings on your account, but to me it just highlights how it's not operating in a way remotely similar to the human-human one I mentioned above. I constantly find myself saying, "No, I meant [concept] in this way, not that way," and then getting annoyed at the robot because it's masquerading as a human.