* Madhavan T. Venkataraman: > Otherwise, using an ABI quirk or a calling convention side effect to > load the PC into a GPR is, IMO, non-standard or non-compliant or > non-approved or whatever you want to call it. I would be > conservative and not use it. Who knows what incompatibility there > will be with some future software or hardware features? AArch64 PAC makes a backwards-incompatible change that touches this area, but we'll see if they can actually get away with it. In general, these things are baked into the ABI, even if they are not spelled out explicitly in the psABI supplement. > For instance, in the i386 example, we do a call without a matching return. > Also, we use a pop to undo the call. Can anyone tell me if this kind of use > is an ABI approved one? Yes, for i386, this is completely valid from an ABI point of view. It's equally possible to use a regular function call and just read the return address that has been pushed to the stack. Then there's no stack mismatch at all. Return stack predictors (including the one used by SHSTK) also recognize the CALL 0 construct, so that's fine as well. The i386 psABI does not use function descriptors, and either approach (out-of-line thunk or CALL 0) is in common use to materialize the program counter in a register and construct the GOT pointer. > If the kernel supplies this, then all applications and libraries can use > it for all architectures with one single, simple API. Without this, each > application/library has to roll its own solution for every architecture-ABI > combo it wants to support. Is there any other user for these type-generic trampolines? Everything else I've seen generates machine code specific to the function being called. libffi is quite the outlier in my experience because the trampoline calls a generic data-driven marshaller/unmarshaller. The other trampoline generators put this marshalling code directly into the generated trampoline. I'm still not convinced that this can't be done directly in libffi, without kernel help. Hiding the architecture-specific code in the kernel doesn't reduce overall system complexity. > As an example, in libffi: > > ffi_closure_alloc() would call alloc_tramp() > > ffi_prep_closure_loc() would call init_tramp() > > ffi_closure_free() would call free_tramp() > > That is it! It works on all the architectures supported in the kernel for > trampfd. ffi_prep_closure_loc would still need to check whether the trampoline has been allocated by alloc_tramp because some applications supply their own (executable and writable) mapping. ffi_closure_alloc would need to support different sizes (not matching the trampoline). It's also unclear to me to what extent software out there writes to the trampoline data directly, bypassing the libffi API (the structs are not opaque, after all). And all the existing libffi memory management code (including the embedded dlmalloc copy) would be needed to support kernels without trampfd for years to come. I very much agree that we have a gap in libffi when it comes to JIT-less operation. But I'm not convinced that kernel support is needed to close it, or that it is even the right design.