Comment by roadside_picnic

roadside_picnic Sep 23, 2025 parent

> after an observation you are absolutely at a new state

The essential property of a Markov chain is maintaining the Markov property:

P(X_n+1 = x_n+1 | X_n = x_n, ..., x_1) = P(X_n+1 = x_n+1 | X_n = x_n)

That is the future, given the present state, is conditionally independent of the past states.

It's worth recognizing that decoder only LLMs (which are most the major LLMs used by people) maintain the Markov property and can be properly understood as Markov chains.

> The attention mechanism is the best we have for this and Markov chains struggle beyond stringing together a few syllables.

Attention has nothing to do with the maintaining the Markov property but allows for a fantastically more complex representation of state which is where decoder only LLMs derive the majority of their power.

tl;dr most of the LLMs people use are effectively Markov Chains.

kgeist Sep 23, 2025

What is the conceptual difference between a "past state" and a "present state"? In a typical conversation with an LLM, you have past messages that remain in the KV cache when generating the next token (so as not to recalculate everything). Yet here, everything that came before is reinterpreted as part of the present state. In other words, it seems as if you can take any past state and fold it into one large concatenated present state - and, by that logic, call basically anything a Markov chain? I can't wrap my head around it. Can you give an example of a non-Markov system?

jltsiren Sep 23, 2025

The difference is arbitrary but fixed. For any given Markov model, there are inputs that can fully fit in the present state and inputs that can't. But for any given input, there are models that can encode it entirely in the present state.

Markov models are useful for understanding the limitations of any specific next-token predictor with a fixed context window. But not for the limitations of such predictors in general, as there is always a Markov model large enough to handle any specific input.

roadside_picnic OP Sep 23, 2025

> In other words, it seems as if you can take any past state and fold it into one large concatenated present state

These are what n-grams are, even traditional token/word/character based Markov chains don't rely on just the most recent word. Typical Markov Chains in NLP are 3-7-grams.

> Can you give an example of a non-Markov system?

Encoder-decoder LLMs violate the Markov Property and would not count as Markov Chains.

nearbuy Sep 24, 2025

If you include the encoder outputs as part of the state, then encoder-decoder LLMs are Markovian as well. While in token space, decoder-only LLMs are not Markovian. Anything can be a Markov process depending what state you include. Humans, or even the universe itself are Markovian. I don't see what insight about LLMs you and other commenters are gesturing at.

Imnimo Sep 23, 2025

A common distinction is that whatever you take to be your state space should be fixed-sized. So an LLM has a fixed context window, which might be extremely large, but with a long enough conversation, the oldest tokens will fall out of the context window, and your process will be non-Markovian. But with a large context size, most conversations will be small enough that they fit entirely within the context, and then the process is Markovian.

So, people can correctly call LLMs Markovian in practice, and also non-Markovian from a theoretical standpoint.

I think of it as conceptually little bit like the distinction between a formal Turing machine which requires an infinite tape, and a practical computer with a finite amount of memory. Your computer acts as a Turing machine for the real computations you use it for, but there exist some computations that would require more memory than you have. From a theoretical standpoint, your compute is merely a finite state automaton.

Imnimo Sep 23, 2025

Sorry, I realized I didn't quite write what I meant to. I didn't intend to say that LLMs are non-Markovian from a theoretical standpoint. I meant to say that the language generation task is non-Markovian from a theoretical standpoint, because the next word can depend on arbitrarily distant history.

vrighter Sep 24, 2025

it can depend on whatever the oldest thing in the context window is, but not anything older

jampekka Sep 23, 2025

> In other words, it seems as if you can take any past state and fold it into one large concatenated present state - and, by that logic, call basically anything a Markov chain?

Essentially yes. Given complex enough state, more or less any process is Markovian. Some do require infinite state, which would be maybe stretching the definition a bit. In practice Markov formulation may not be a very good analysis perspective if the required state would be very large/complex.

fluoridation Sep 23, 2025

Hmm... For example, say you have a machine that must decide whether a monetary transaction is allowable or not. The state at any given time could be the balances of all accounts, plus some metadata about past states (but not the past states themselves). Then if A has $5 in their account and tries to transfer $5 to B, the contents of the metadata would be the deciding factor of whether the transaction goes through or not. The machine still only operates on the present state, but the present state depends indirectly on past states.

kgeist Sep 23, 2025

But a conversation with an LLM doesn't only depend on the last message, it depends on all previous messages. Sure the number of messages is limited by the context window, so you can indeed say it's a high-order Markov chain (it can only see N states back), but theoretically, what an actual non-Markovian LLM would look like? Something which has an infinite context? It's not physically possible... At this point it sounds like "it's a Markov chain" is not a very useful statement.

thaumasiotes Sep 23, 2025

You appear to be confusing the UI over the LLM with the actual working of the LLM. None of them allow for more than one message to exist. You put in one message and you get one word (well, one token) back.

kgeist Sep 23, 2025

I have an idea of how they work. At the end of the day, it's all just numbers. You say you put "one message", but in reality it's N input numbers corresponding to the tokens (where N can very from 1 to any size). Basically N variables. An LLM is one large function which takes N values as input and produces a new output value. So you can view an LLM as y = f(a, b, c, ...)

What I don't understand is how the line is drawn between "past" and "present". Why are variables a,b,c viewed as "present" and not "past"? How is past defined? Why can't you view token generation as f(past1, past2, past3, present)?

As I said, it does look "Markovian" if the number of "pasts" is fixed, but again, what exactly is not Markovian then? Infinite inputs? Everything in the real world is finite.

7 More Comments →

fluoridation Sep 23, 2025

>what an actual non-Markovian LLM would look like? Something which has an infinite context? It's not physically possible... At this point it sounds like "it's a Markov chain" is not a very useful statement.

Uh... You're moving the goalposts. I don't know the exact definition of "LLM", but let's suppose that it and the definition of "Markov chain" make us conclude that LLMs are necessarily Markov chains. "It's a Markov chain" is still a useful statement, because there are processes that are neither LLMs nor Markov chains.

Veedrac Sep 24, 2025

The point being made is equivalent to pointing out that by strict mathematical definition there aren't physically reliazable distinguishably-non-Markovian processes, because computers are finite objects. Yes, you can talk about Markov chains as referring to everything up to and including actual human brains, but the kind of Markov chains people actually refer to as Markovian in practice are much more specific than that, typically those systems with structurally simple, indexical nodes, but sometimes extending to more complex ideas like RNNs where the states are at least of finite bandwidth. An LLM not only has continuous-valued nodes, it has nodes whose description length grows proportionally to and theoretically-unboundedly with respect to the size of its history.

imtringued Sep 24, 2025

>because there are processes that are neither LLMs nor Markov chains.

I'm not sure what those would be. All dynamical systems (essentially all of classical physics) except maybe quantum mechanics are a Markov chain.

fluoridation Sep 24, 2025

Well, I gave an example already, so...

eigencoder Sep 23, 2025

Could you help me understand how decoder-only LLMs maintain the Markov property? If you used the same random seed, the input to the model "The cow jumped over the" would not give the same output as just "the", right? So isn't that violating the Markov property?

OJFord Sep 23, 2025

State (in this sense at least) isn't word/token parsing progress, it's comprising all the input and any context (which may include the entire chat history for example).

AnotherGoodName Sep 23, 2025

There would be need to a state specifically for “the cow jumped over the” (and any other relevant context) and states for all the other times ‘the’ is proceeded by something.

This is the limitation i was getting at btw. In the example i wad getting at, if you have an image with solid vertical columns, followed by columns of random static, followed again by solid vertical colors a markov chain could eventually learn all patterns that go

solid->32 random bits->different solid color

And eventually it would start predicting the different color correctly based on the solid color before the randomness. It ‘just’ needs a state for every possible random color between. This is ridiculous in practice however since you’d need to learn 2^32 states just for relation ship between those two solid colors alone.

thesz Sep 23, 2025

> It ‘just’ needs a state for every possible random color between.

You can use skipgrams - prefixes with holes in them.

Sparse Non-negative Matrix Language Model [1] uses them with great success.

[1] https://aclanthology.org/Q16-1024/

The pure n-gram language models would have hard time computing escape weights for such contexts, but mixture of probabilities that is used in SNMLM does not need to do that.

If I may, I've implemented an online per-byte version of SNMLM [2], which allows skipgrams' use. They make performance worse, but they can be used. SNMLM's predictive performance for my implementation is within percents to performance of LSTM on enwik8.

[2] https://github.com/thesz/snmlm-per-byte

mjburgess Sep 23, 2025

They are Markov Chains, any one saying otherwise doesn't understand what a Markov chain is

lackoftactics Sep 23, 2025

Wouldn't tool calling, mcp break finite state?

jameshart Sep 24, 2025

Tool calling is just part of the protocol you're using for how to handle states output by your Markov model. Same goes for chat input, for that matter.

Chat interfaces are like taking a simple Markov generator, but with a rule where you say 'whenever it reaches a state ending in X, hand over decision making about state transitions to a human instead of a random number generator, until they move it to state ending in Y'.

Tool calling is similar - 'when it reaches state ending in X, send the state data to a tool; use the result to drive a series of state transitions; then start generating again'.

naasking Sep 24, 2025

Memory Augmented Large Language Models are Computationally Universal, https://arxiv.org/abs/2301.04589

hk__2 Sep 24, 2025

Authority arguments like "I’m right and if you disagree with me you don’t understand anything" are not exactly helpful.

mjburgess Sep 24, 2025

It's not an authority argument, its the opposite. A warning about the supposed authority of people who would make that kind of claim.

The validation of what i'm saying is little more than a few minutes google away.

Some interventions can just be warnings, rather than lectures.

nadalishhh Sep 25, 2025

Exactly right on the mathematical equivalence! Your clarification of the Markov property is crucial - the conditional independence given the present state is what defines it, regardless of how complex that state representation becomes.

The key insight you raise about attention is particularly important: it doesn't violate the Markov property, it just enables a vastly more sophisticated state representation. Classical n-gram Markov chains use simple discrete states, while transformers use high-dimensional continuous representations that can encode exponentially more information about context.

This perspective helps bridge the conceptual gap many people have. When they think "Markov chain," they often picture simple state transitions, but mathematically, LLMs are just Markov chains with astronomically complex states. The attention mechanism is the computational trick that makes these complex states tractable - it's not changing the fundamental probabilistic structure.

rcfox Sep 23, 2025

Is this just because LLMs don't have state?

As far as I understand it, as you have a back-and-forth conversation with an LLM, you have to provide the entire history of the conversation plus your new response each time.

jampekka Sep 23, 2025

Stateful models, e.g. RNNs, are Markov models too. Sometimes "Markov chain" is used to refer specifically to models with no hidden state, e.g. (decoder-only) Transformers.

red75prime Sep 24, 2025

> tl;dr most of the LLMs people use are effectively Markov Chains.

...when they are used without RAG and tools. x_n belongs to a set of cardinality 65536^100000 or so. Equating an LLM to a Markov Chain doesn't allow to make any non-trivial predictions about its behavior.

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous