Markov models are useful for understanding the limitations of any specific next-token predictor with a fixed context window. But not for the limitations of such predictors in general, as there is always a Markov model large enough to handle any specific input.
These are what n-grams are, even traditional token/word/character based Markov chains don't rely on just the most recent word. Typical Markov Chains in NLP are 3-7-grams.
> Can you give an example of a non-Markov system?
Encoder-decoder LLMs violate the Markov Property and would not count as Markov Chains.
So, people can correctly call LLMs Markovian in practice, and also non-Markovian from a theoretical standpoint.
I think of it as conceptually little bit like the distinction between a formal Turing machine which requires an infinite tape, and a practical computer with a finite amount of memory. Your computer acts as a Turing machine for the real computations you use it for, but there exist some computations that would require more memory than you have. From a theoretical standpoint, your compute is merely a finite state automaton.
Essentially yes. Given complex enough state, more or less any process is Markovian. Some do require infinite state, which would be maybe stretching the definition a bit. In practice Markov formulation may not be a very good analysis perspective if the required state would be very large/complex.
What I don't understand is how the line is drawn between "past" and "present". Why are variables a,b,c viewed as "present" and not "past"? How is past defined? Why can't you view token generation as f(past1, past2, past3, present)?
As I said, it does look "Markovian" if the number of "pasts" is fixed, but again, what exactly is not Markovian then? Infinite inputs? Everything in the real world is finite.
Uh... You're moving the goalposts. I don't know the exact definition of "LLM", but let's suppose that it and the definition of "Markov chain" make us conclude that LLMs are necessarily Markov chains. "It's a Markov chain" is still a useful statement, because there are processes that are neither LLMs nor Markov chains.
I'm not sure what those would be. All dynamical systems (essentially all of classical physics) except maybe quantum mechanics are a Markov chain.
This is the limitation i was getting at btw. In the example i wad getting at, if you have an image with solid vertical columns, followed by columns of random static, followed again by solid vertical colors a markov chain could eventually learn all patterns that go
solid->32 random bits->different solid color
And eventually it would start predicting the different color correctly based on the solid color before the randomness. It ‘just’ needs a state for every possible random color between. This is ridiculous in practice however since you’d need to learn 2^32 states just for relation ship between those two solid colors alone.
You can use skipgrams - prefixes with holes in them.
Sparse Non-negative Matrix Language Model [1] uses them with great success.
[1] https://aclanthology.org/Q16-1024/
The pure n-gram language models would have hard time computing escape weights for such contexts, but mixture of probabilities that is used in SNMLM does not need to do that.
If I may, I've implemented an online per-byte version of SNMLM [2], which allows skipgrams' use. They make performance worse, but they can be used. SNMLM's predictive performance for my implementation is within percents to performance of LSTM on enwik8.
Chat interfaces are like taking a simple Markov generator, but with a rule where you say 'whenever it reaches a state ending in X, hand over decision making about state transitions to a human instead of a random number generator, until they move it to state ending in Y'.
Tool calling is similar - 'when it reaches state ending in X, send the state data to a tool; use the result to drive a series of state transitions; then start generating again'.
The validation of what i'm saying is little more than a few minutes google away.
Some interventions can just be warnings, rather than lectures.
The key insight you raise about attention is particularly important: it doesn't violate the Markov property, it just enables a vastly more sophisticated state representation. Classical n-gram Markov chains use simple discrete states, while transformers use high-dimensional continuous representations that can encode exponentially more information about context.
This perspective helps bridge the conceptual gap many people have. When they think "Markov chain," they often picture simple state transitions, but mathematically, LLMs are just Markov chains with astronomically complex states. The attention mechanism is the computational trick that makes these complex states tractable - it's not changing the fundamental probabilistic structure.
As far as I understand it, as you have a back-and-forth conversation with an LLM, you have to provide the entire history of the conversation plus your new response each time.
...when they are used without RAG and tools. x_n belongs to a set of cardinality 65536^100000 or so. Equating an LLM to a Markov Chain doesn't allow to make any non-trivial predictions about its behavior.
The essential property of a Markov chain is maintaining the Markov property:
P(X_n+1 = x_n+1 | X_n = x_n, ..., x_1) = P(X_n+1 = x_n+1 | X_n = x_n)
That is the future, given the present state, is conditionally independent of the past states.
It's worth recognizing that decoder only LLMs (which are most the major LLMs used by people) maintain the Markov property and can be properly understood as Markov chains.
> The attention mechanism is the best we have for this and Markov chains struggle beyond stringing together a few syllables.
Attention has nothing to do with the maintaining the Markov property but allows for a fantastically more complex representation of state which is where decoder only LLMs derive the majority of their power.
tl;dr most of the LLMs people use are effectively Markov Chains.