On Tue, Mar 22, 2022 at 11:04:38AM -0400, Steven Rostedt wrote: > > In recap: > > > > __fentry__ -- push on trace-stack > > __ftail__ -- mark top-most entry complete > > __fexit__ -- mark top-most entry complete; > > pop all completed entries > > Again, this would require that the tail-calls are also being traced. Which is why we should inhibit tail-calls if the function is notrace. > > inhibit tail-calls to notrace. > > Just inhibiting tail-calls to notrace would work without any of the above. I'm lost again; what? Without any of the above you got nothing because return-trampoline will not work. > But my fear is that will cause a noticeable performance impact. Most code isn't in fact notrace, and call+ret aren't *that* expensive. > > It's function graph tracing, kretprobes and whatever else this rethook > > stuff is about that needs this because return trampolines will stop > > working somewhere in the not too distant future. > > Another crazy solution is to have: > > func_A: > call __fentry__ > ... > tail: jmp 1f > call 1f > call __fexit__ > ret > 1: jmp func_B > > > where the compiler tells us about "tail:" and that we know that func_A ends > with a tail call, and if we want to trace the end of func_A we convert that > jmp 1f into a nop. And then we call the func_B and it's return comes back > to where we call __fexit__ and then return normally. At that point giving us something like: 1: pushsection __ftail_loc .long 1b - . popsection jmp.d32 func_B call __fexit__ ret is smaller and simpler, we can patch the jmp.d32 to call when tracing. The only problem is SLS, that might wants an int3 after jmp too ( https://www.amd.com/en/corporate/product-security/bulletin/amd-sb-1026 ). That does avoid the need for __ftail__ I suppose.