It's a bit more subtle though, if I understand correctly this only works for parallelizable problems. Which makes intuitive sense since the model cannot pass information along with each dot. So in that sense COT can be seen as some form of sampling, which also tracks with findings that COT doesn't boost the "raw intelligence" but rather uncovers latent intelligence, converting pass@k to maj@k. Antirez touches upon this in [1].
On the other hand, I think problems with serial dependencies require "real" COT since the model needs to track the results of subproblems. There's also some studies which show a meta-structure to the COT itself though, e.g. if you look at DeepSeek there are clear patterns of backtracking and such that are slightly more advanced than naive repeated samplings. https://arxiv.org/abs/2506.19143
It's not very efficient though, because for token i layer N can only receive as input layer N-1 for tokens i-1, i-2... So information is sort of passed along diagonally. If handwavily the embedding represents some "partial result" then it can be passed along diagonally from (N-1, i-1) to (N, i) to have the COT for token i+1 continue to work on it. So this way even though the total circuit depth is still bounded by # of layers, it's clearly "more powerful" than just naively going from layer 1...n, because during the other steps you can maybe work on something else.
But it's still not as powerful as allowing the results at layer n to be fed back in, which effectively unrolls the depth. This maybe intuitively justifies the results in the paper (I think it also has some connection to communication complexity).