edit: 30% improvement, still 100x slower than e.g. Rust.
def extern from "newplus/plus.h":
cpdef int plusone(int x)
cdef extern from "newplus/plus.h":
cpdef long long current_timestamp()
def run(int count):
cdef int start
cdef int out
cdef int x = 0
start = current_timestamp()
while x < count:
x = plusone(x)
out = current_timestamp() - start
return out
Actually yields 597 compared to the pure c program yielding 838.If you need a fast loop do not use Python.
I am a Python hater, but this is unfair. Python is not designed to do fast loops. Crossing the FFI boundary happens very few times compared to iterations of tight loops.
(I have very little experience using FFI, but I am about to - hence keen interest)
Otherwise it will do an attribute lookup in each loop iteration, Python has no way to assume zero side-effects of function calls, in case lib.plusone was overwritten to something new inside the plusone function.
Doesn't feel like that would be the case from using NumPy, PyTorch and the likes, but they also typically run 'fat' functions, where it's one function with a lot of data that returns something. Usually don't chain or loop much there.
Edit: the number was for 500 million calls. Yeah, don't think I've ever made that many calls. 123 seconds feels fairly short then, except for demanding workflows like game dev maybe.
https://github.com/dyu/ffi-overhead/pull/18