On Wed, Apr 24, 2024 at 5:02 PM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote: > > At the lowest level, rethook-based kretprobes on x86-64 architecture go > through arch_rethoook_trampoline() function, manually written in > assembly, which calls into a simple arch_rethook_trampoline_callback() > function, written in C, and only doing a few straightforward field > assignments, before calling further into rethook_trampoline_handler(), > which handles kretprobe callbacks generically. > > Looking at simplicity of arch_rethook_trampoline_callback(), it seems > not really worthwhile to spend an extra function call just to do 4 or > 5 assignments. As such, this patch proposes to "inline" > arch_rethook_trampoline_callback() into arch_rethook_trampoline() by > manually implementing it in an assembly code. > > This has two motivations. First, we do get a bit of runtime speed up by > avoiding function calls. Using BPF selftests's bench tool, we see > 0.6%-0.8% throughput improvement for kretprobe/multi-kretprobe > triggering code path: > > BEFORE (latest probes/for-next) > =============================== > kretprobe : 10.455 ± 0.024M/s > kretprobe-multi: 11.150 ± 0.012M/s > > AFTER (probes/for-next + this patch) > ==================================== > kretprobe : 10.540 ± 0.009M/s (+0.8%) > kretprobe-multi: 11.219 ± 0.042M/s (+0.6%) > > Second, and no less importantly for some specialized use cases, this > avoids unnecessarily "polluting" LBR records with an extra function call > (recorded as a jump by CPU). This is the case for the retsnoop ([0]) > tool, which relies havily on capturing LBR records to provide users with > lots of insight into kernel internals. > > This RFC patch is only inlining this function for x86-64, but it's > possible to do that for 32-bit x86 arch as well and then remove > arch_rethook_trampoline_callback() implementation altogether. Please let > me know if this change is acceptable and whether I should complete it > with 32-bit "inlining" as well. Thanks! > > [0] https://nakryiko.com/posts/retsnoop-intro/#peering-deep-into-functions-with-lbr > > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx> > --- > arch/x86/kernel/asm-offsets_64.c | 4 ++++ > arch/x86/kernel/rethook.c | 37 +++++++++++++++++++++++++++----- > 2 files changed, 36 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c > index bb65371ea9df..5c444abc540c 100644 > --- a/arch/x86/kernel/asm-offsets_64.c > +++ b/arch/x86/kernel/asm-offsets_64.c > @@ -42,6 +42,10 @@ int main(void) > ENTRY(r14); > ENTRY(r15); > ENTRY(flags); > + ENTRY(ip); > + ENTRY(cs); > + ENTRY(ss); > + ENTRY(orig_ax); > BLANK(); > #undef ENTRY > > diff --git a/arch/x86/kernel/rethook.c b/arch/x86/kernel/rethook.c > index 8a1c0111ae79..3e1c01beebd1 100644 > --- a/arch/x86/kernel/rethook.c > +++ b/arch/x86/kernel/rethook.c > @@ -6,6 +6,7 @@ > #include <linux/rethook.h> > #include <linux/kprobes.h> > #include <linux/objtool.h> > +#include <asm/asm-offsets.h> > > #include "kprobes/common.h" > > @@ -34,10 +35,36 @@ asm( > " pushq %rsp\n" > " pushfq\n" > SAVE_REGS_STRING > - " movq %rsp, %rdi\n" > - " call arch_rethook_trampoline_callback\n" > + " movq %rsp, %rdi\n" /* $rdi points to regs */ > + /* fixup registers */ > + /* regs->cs = __KERNEL_CS; */ > + " movq $" __stringify(__KERNEL_CS) ", " __stringify(pt_regs_cs) "(%rdi)\n" > + /* regs->ip = (unsigned long)&arch_rethook_trampoline; */ > + " movq $arch_rethook_trampoline, " __stringify(pt_regs_ip) "(%rdi)\n" > + /* regs->orig_ax = ~0UL; */ > + " movq $0xffffffffffffffff, " __stringify(pt_regs_orig_ax) "(%rdi)\n" > + /* regs->sp += 2*sizeof(long); */ > + " addq $16, " __stringify(pt_regs_sp) "(%rdi)\n" > + /* 2nd arg is frame_pointer = (long *)(regs + 1); */ > + " lea " __stringify(PTREGS_SIZE) "(%rdi), %rsi\n" BTW, all this __stringify() ugliness can be avoided if we move this assembly into its own .S file, like lots of other assembly functions in arch/x86/kernel subdir. That has another benefit of generating better line information in DWARF for those assembly instructions. It's lots more work, so before I do this, I'd like to get confirmation that this change is acceptable in principle. > + /* > + * The return address at 'frame_pointer' is recovered by the > + * arch_rethook_fixup_return() which called from this > + * rethook_trampoline_handler(). > + */ > + " call rethook_trampoline_handler\n" > + /* > + * Copy FLAGS to 'pt_regs::ss' so we can do RET right after POPF. > + * > + * We don't save/restore %rax below, because we ignore > + * rethook_trampoline_handler result. > + * > + * *(unsigned long *)®s->ss = regs->flags; > + */ > + " mov " __stringify(pt_regs_flags) "(%rsp), %rax\n" > + " mov %rax, " __stringify(pt_regs_ss) "(%rsp)\n" > RESTORE_REGS_STRING > - /* In the callback function, 'regs->flags' is copied to 'regs->ss'. */ > + /* We just copied 'regs->flags' into 'regs->ss'. */ > " addq $16, %rsp\n" > " popfq\n" > #else > @@ -61,6 +88,7 @@ asm( > ); > NOKPROBE_SYMBOL(arch_rethook_trampoline); > > +#ifdef CONFIG_X86_32 > /* > * Called from arch_rethook_trampoline > */ > @@ -70,9 +98,7 @@ __used __visible void arch_rethook_trampoline_callback(struct pt_regs *regs) > > /* fixup registers */ > regs->cs = __KERNEL_CS; > -#ifdef CONFIG_X86_32 > regs->gs = 0; > -#endif > regs->ip = (unsigned long)&arch_rethook_trampoline; > regs->orig_ax = ~0UL; > regs->sp += 2*sizeof(long); > @@ -92,6 +118,7 @@ __used __visible void arch_rethook_trampoline_callback(struct pt_regs *regs) > *(unsigned long *)®s->ss = regs->flags; > } > NOKPROBE_SYMBOL(arch_rethook_trampoline_callback); > +#endif > > /* > * arch_rethook_trampoline() skips updating frame pointer. The frame pointer > -- > 2.43.0 >