kragen parent
With MESI cache coherence, maybe you could migrate the whole workspace for your subroutine into a line of your core's L1D cache, and make it perform like hardware registers while retaining the pleasantly parsimonious TMS 990 architectural semantics?