Scheme implementations targeting C (e.g. Chicken[1] and Gambit[2]) have literal decades of experience doing tail calls, but admittedly the resulting translations (trampolining, Cheney on the MTA, whatever Gambit does) do not make for very natural C. The C paper[3] you mention does make note of this.
The limits of Clang’s musttail stem from the limitations of calling conventions (which, regardless of what the standard says, allow for passing more arguments than the callee expects, thus require caller cleanup). So far the solutions[4] to this seem rather awkward and definitely not backwards compatible. Are you aware of any convincing proposals in this direction?
[3] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2920.pdf
I've been able to get around this in language-to-C compilation by inserting trampoline macros. The first instance of the call deploys the trampoline function, further tail calls return to it with the next function pointer as a parameter.
wasm3 uses tailcalls to implement its interpreter bytecode handlers and it manages to successfully force tail-call optimization in both gcc and llvm. Worth having a look on how it does that?
It's possible, but for GCC requires optimizations to be enabled. MSVC is completely off-limits in that approach.
The C committee voted 8-7-8 against including mandatory tail calls in C23 (https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2941.pdf), mostly I think because of a lack of implementation experience.
LLVM has everything needed via the musttail marker, but the musttail attribute in clang is a lot more restrictive (it enforces that the caller and callee function signatures are the same).
It might theoretically be possible to get more flexibility plumbed through to clang, but then I'm guessing getting GCC on board would be a huge job too. :-/