Comment by mananaysiempre

mananaysiempre May 14, 2022 parent

ETA: I see now I was answering the wrong question: you were asking about the comparison between C and LuaJIT, not heavier FFIs and C/LuaJIT.

Honestly I think of the difference (as discussed in Wellons’s post among others) not as a performance optimization but as an anti-stupidity optimization: regardless of the performance impact, it’s stupid that the standard ELF ABI forces us to jump through these hoops for every foreign call, and even stupider that plain inter- and even intra-compilation-unit calls can also be affected unless you take additional measures. Things are also being fixed on the C side with things such as -fvisibility=, -fno-semantic-interposition, -fno-plt, and new relocation types.

Can this be relevant to performance? Probably—aside from just doing more stuff, there are trickier-to-predict parts of the impact such as buffer pressure on the indirect branch predictor. Does it? Not sure. The theoretical possibility of interposition preventing inlining of publicly-accessible functions is probably much more important, at the very least I have seen it make a difference. But this falls outside the scope of FFI, strictly speaking, even if the cause is related.

---

I don’t have a readily available example, but in the LuaJIT case there are two considerations that I can mention:

- FFI is not just cheap but gets into the realm of a native call (perhaps an indirect one), so a well-adapted inner loop is not ruined even if it makes several FFI calls per iteration (it will still be slower, but this is fractions not multiples unless the loop did not allocate at all before the change). What this influences is perhaps not even the final performance but the shape of the API boundary: similarly to the impact of promise pipelining for RPC[1], you’re no longer forced into the “construct job, submit job” mindset and coarse-grained calls (think NumPy). Even calling libm functions through the FFI, while probably not very smart, isn’t an instant death sentence, so not as many things are forced to be reimplemented in the language as you’re used to.

- The JIT is wonderfully speedy and simple, but draws much of that speed and simplicity from the fact that it really only understands two shapes of control flow: straight-line code; and straight-line code leading into a loop with straight-line code in the body. Other control transfers aren’t banned as such, but are built on top of these, can only be optimized across to a limited extent, and can confuse the machinery that decides what to trace. This has the unpleasant corollary that builtins, which are normally implemented as baked-in bytecode, can’t usefully have loops in them. The solution uses something LuaJIT 2.1 calls trace stitching: the problematic builtins are implemented in normal C and are free to have arbitrarily complex control flow inside, but instead of outright aborting the trace due to an unJITtable builtin the compiler puts what is effectively an FFI call into it.

[1] https://capnproto.org/rpc.html

This item has no comments currently.