ryao parent
I find their use of an on-GPU interpreter to be both a bit of an odd choice and interesting at the same time. Usually, you would not want to use an on-GPU interpreter for anything involving high performance. However, it sounds to me like there is not much room for improvement left under Amdahl's law since the instructions should call highly parallel functions that run orders of magnitude longer than the interpreter does to in order to make the function calls This in itself is interesting, although I still wonder how much room for improvement there would be if they dropped the interpreter.
As the interpreter is core to the approach, I'm not entirely sure what's left if you drop that.