Comment by throw827474737

throw827474737 May 14, 2022 parent

So why isn't C the baseline (and zig and rust being pretty close to it quite expected), but both luajit and julia are significantly faster??

gallexme May 14, 2022

https://nullprogram.com/blog/2018/05/27/

eatonphil May 14, 2022

> For the C “FFI” he used standard dynamic linking, not dlopen(). This distinction is important, since it really makes a difference in the benchmark. There’s a potential argument about whether or not this is a fair comparison to an actual FFI, but, regardless, it’s still interesting to measure

jcelerier May 14, 2022

With clang, just compiling with -fno-plt gives me:

    jit: 1.003483 ns/call
    plt: 1.254158 ns/call
    ind: 1.254616 ns/call

GCC does not seem to support it though, even if it accepts the flag and gives me:

    jit: 1.003483 ns/call
    plt: 1.502089 ns/call
    ind: 1.254616 ns/call

(tried everything I could think of that would have a chance to make the PLT disappear:

    cc -fno-plt -Bsymbolic -fno-semantic-interposition -flto -std=c99 -Wall -Wextra -O3 -g3 -Wl,-z,relro,-z,now -o benchmark benchmark.c ./empty.so -ldl

without any change on GCC)

arinlen May 14, 2022

> There’s a potential argument about whether or not this is a fair comparison to an actual FFI, but, regardless, it’s still interesting to measure (...)

If there's interest in measuring dynamic linking then wouldn't there be an interest in measuring it on all languages that support dynamic linking?

qalmakka May 14, 2022

I'm always pretty surprised when I find out most people writing C or C++ have no idea that PLTs exist. They have a small but not negligible cost.

miohtama May 14, 2022

Is there anything akin FFI but with static linking for any foreign (non C) language?

tlb May 14, 2022

Calling WebAssembly from Javascript, sort of?

In the early Python 2 era there was an option to build an interpreter binary with statically linked C stubs, and it was noticeably faster and let you access Python data structures from C. I used it for robotics code for speed. It was inconvenient because you had to link in all the modules you needed.

fweimer May 14, 2022

For OpenJDK, there is JEP 178: https://openjdk.java.net/jeps/178 I haven't seen it used in practice.

Ocaml's C-implemented functions are linked statically. But like JNI, the C functions have special names and type signatures, so it is slightly different from, say, ctypes in Python.

CGO for Go is statically linked, too. Its overhead stems from significant differences between the Go and C world. The example uses dynamic linking, but it would not have to do that.

junon May 14, 2022

The question as it stands makes a few assumptions I don't think one can make, and as such is a bit tricky to answer cleanly, but I'll try.

Yes it's just called linking. The language needs to be aware of calling conventions and perhaps side effects and be prepared for no additional intrinsic support for higher level features.

It probably also needs to be able to read C headers, because C symbols do not contain type signatures like many C++ compilers add.

There's no "library" or some out of the box solution for this, if that's what you're asking. This boils down to how programs are constructed and, moreso, how CPUs work.

In most (all?) cases, anything higher level than straight-up linking is headed toward FFI territory.

samatman May 14, 2022

LuaJIT can use the FFI against statically linked object code just fine, I'm not sure if that answers your question since in this context it must be embedded in a C program.

It's a hard requirement of static linking that you have just one binary so it might, answer your question that is.

bachmeier May 14, 2022

C, C++, Zig, Rust, D, and Haskell are all similar because they're basically doing the same thing. Someone else linked to the blog post, but Lua and Julia aren't doing the same thing, so they get different results.

> both luajit and julia are significantly faster

I would be interested if anyone has an example where the difference matters in practice. As soon as you move to the more realistic scenario where you're writing a program that does something other than what is measured by these benchmarks, that's not going to be your biggest concern.

kllrnohj May 14, 2022

> I would be interested if anyone has an example where the difference matters in practice.

Vulkan. Any sort of binding to Vulkan over a non-trivial FFI (so like, not from C++, Rust, etc...) is going to be murdered by this FFI overhead cost. Especially since for bindings from something like Java you're either paying FFI on every field set on a struct, or you're paying non-trivial marshalling costs to convert from a Java class to a C struct to then finally call the corresponding Vulkan function.

joeld42 May 14, 2022

Not really, you're usually setting up commands and buffers and stuff in Vulkan. If you're making millions of calls a frame, you're going to have other bottlenecks.

My favorite example is something like Substance designer's node graph or Disney's SeExpr. You'd often want custom nodes that do often something trivial like a lookup from a custom data format or a small math evaluation, but you're calling the node potentially a handful of times per pixel, on millions of pixels. The calling overhead often comes out to take as much time or more than the operation, but there's no easy way to rearrange the operations without making things a lot more complicated for everyone.

I kind of like python's approach, make it so slow that it's easy to notice when you're hitting the bottleneck. Encourages you to write stuff that works in larger operations, and you get stuff like numpy and tensorflow which are some of the fastest things out there despite the slowest binding.

https://www.disneyanimation.com/technology/seexpr-expression...

kllrnohj May 14, 2022

> Not really, you're usually setting up commands and buffers and stuff in Vulkan

Those commands and buffers are represented as C structs. If you're in a language that can't speak C structs (like Java, Go, Dart, JavaScript, etc...), all of those command & buffer setup become function calls rather than simple field writes.

bachmeier May 14, 2022

> Especially since for bindings from something like Java

I guess I wasn't clear, but I meant the difference between C and Luajit.

kllrnohj May 14, 2022

Ah. The answer to that is a lot more murky, since in an actual C/C++ program you're going to have a mix of local, static, and dynamic linking. You're generally not putting super chatty stuff across a dynamic linkage, since that tends to be where the stable API boundaries go. Anything internal is then going to be static linkage, so comparable to luajit, or inlined (either by the compiler initially or with something like LTO) and then even faster than luajit

vvanders May 14, 2022

Oh it totally matters, any sort of chatty interface over FFI you will pay for it.

There's a reason a lot of gamedev uses luajit, I've personally had to refactor many interfaces to avoid JNI calls as much as possible as there was significant overhead(both in the call and from the VM not being able to optimize around it).

kllrnohj May 14, 2022

The reason a lot of gamedev uses luajit is the ease at which it can be embedded.

And that's not really even true anymore as the majority of gamedev is using Unreal or Unity, neither of which use luajit.

vvanders May 14, 2022

It's not just how easy it is to embed, it's also really small in both code + runtime size. I've shipped it on systems with sub 8mb of total system memory(we used a preallocated 400kb block), until quickjs came along there really wasn't anything comparable. It was also much faster than anything else at the time and regularly beat v8 in the benchmarks I ran.

Unity+Unreal are the public engines out there but there's plenty of in-house engines and tool chains you don't really hear about. I wouldn't be surprised if it's still deployed in quite a few contexts.

mananaysiempre May 14, 2022

ETA: I see now I was answering the wrong question: you were asking about the comparison between C and LuaJIT, not heavier FFIs and C/LuaJIT.

Honestly I think of the difference (as discussed in Wellons’s post among others) not as a performance optimization but as an anti-stupidity optimization: regardless of the performance impact, it’s stupid that the standard ELF ABI forces us to jump through these hoops for every foreign call, and even stupider that plain inter- and even intra-compilation-unit calls can also be affected unless you take additional measures. Things are also being fixed on the C side with things such as -fvisibility=, -fno-semantic-interposition, -fno-plt, and new relocation types.

Can this be relevant to performance? Probably—aside from just doing more stuff, there are trickier-to-predict parts of the impact such as buffer pressure on the indirect branch predictor. Does it? Not sure. The theoretical possibility of interposition preventing inlining of publicly-accessible functions is probably much more important, at the very least I have seen it make a difference. But this falls outside the scope of FFI, strictly speaking, even if the cause is related.

---

I don’t have a readily available example, but in the LuaJIT case there are two considerations that I can mention:

- FFI is not just cheap but gets into the realm of a native call (perhaps an indirect one), so a well-adapted inner loop is not ruined even if it makes several FFI calls per iteration (it will still be slower, but this is fractions not multiples unless the loop did not allocate at all before the change). What this influences is perhaps not even the final performance but the shape of the API boundary: similarly to the impact of promise pipelining for RPC[1], you’re no longer forced into the “construct job, submit job” mindset and coarse-grained calls (think NumPy). Even calling libm functions through the FFI, while probably not very smart, isn’t an instant death sentence, so not as many things are forced to be reimplemented in the language as you’re used to.

- The JIT is wonderfully speedy and simple, but draws much of that speed and simplicity from the fact that it really only understands two shapes of control flow: straight-line code; and straight-line code leading into a loop with straight-line code in the body. Other control transfers aren’t banned as such, but are built on top of these, can only be optimized across to a limited extent, and can confuse the machinery that decides what to trace. This has the unpleasant corollary that builtins, which are normally implemented as baked-in bytecode, can’t usefully have loops in them. The solution uses something LuaJIT 2.1 calls trace stitching: the problematic builtins are implemented in normal C and are free to have arbitrarily complex control flow inside, but instead of outright aborting the trace due to an unJITtable builtin the compiler puts what is effectively an FFI call into it.

[1] https://capnproto.org/rpc.html

This item has no comments currently.