The second approach, with RL, is based on immediate feedback and could make a model smarter than us. Just think of AlphaZero or AlphaTensor. But this requires deploying a wide search over possible solutions and using a mechanism to rank or filter the bad ideas out (code execution, running a simulation or a game, optimizing some metric)
So models need both past experience and new experience to advance. They can use organic text initially, but later need to develop their own training examples. The feedback they get will be on topic, both with the human user and with the model mistakes. That's very valuable. Feedback learning is what could make LLMs finally graduate from mediocre results.
DeepMind is saying they are using both, and feedback learning is dialed up.