Why not? Video is a much more tractable problem because you have much more information to go on.
When you want to stay faithful to the actual data then your options are limited, for quite a large part of the image a simple convolution is about as good as it gets, except for the edges. Basically the only problem we couldn't solve 20 years ago was excessive ringing (which is why softer scaling algorithms were preferred). You can put quite a lot of effort into getting clearer edges, especially for thin lines, but for most content you can't expect too much more sharpness than what the basic methods get you.
And then there is the generative approach where you just make stuff up. It's quite effective but a bit iffy. It's fine for entertainment but it's debatable if the result is actually a true rescale of the image (and if you average over the distribution of possible images the result is too soft again).
In theory video can do better by merging several frames of the same content.
Note that a limitation of this result is that it assumes a static scene, but that's already a typical limitation of most gaussian splat applications anyway, so it kind of doesn't matter?
I'm trying to make an opensource NN camera pipeline (objective is to be able to run on smartphones but offline, not real-time), and I'm still barely managing the demosaicing part... would you be open to discussing with me?
Video, where the result needs to be temporally coherent and make sense in 3D, can't be the easier one.