On Fri, Feb 28, 2025 at 2:55 PM Jiri Olsa <olsajiri@xxxxxxxxx> wrote: > > On Fri, Feb 28, 2025 at 10:55:24AM -0800, Andrii Nakryiko wrote: > > On Mon, Feb 24, 2025 at 6:04 AM Jiri Olsa <jolsa@xxxxxxxxxx> wrote: > > > > > > Putting together all the previously added pieces to support optimized > > > uprobes on top of 5-byte nop instruction. > > > > > > The current uprobe execution goes through following: > > > - installs breakpoint instruction over original instruction > > > - exception handler hit and calls related uprobe consumers > > > - and either simulates original instruction or does out of line single step > > > execution of it > > > - returns to user space > > > > > > The optimized uprobe path > > > > > > - checks the original instruction is 5-byte nop (plus other checks) > > > - adds (or uses existing) user space trampoline and overwrites original > > > instruction (5-byte nop) with call to user space trampoline > > > - the user space trampoline executes uprobe syscall that calls related uprobe > > > consumers > > > - trampoline returns back to next instruction > > > > > > This approach won't speed up all uprobes as it's limited to using nop5 as > > > original instruction, but we could use nop5 as USDT probe instruction (which > > > uses single byte nop ATM) and speed up the USDT probes. > > > > > > This patch overloads related arch functions in uprobe_write_opcode and > > > set_orig_insn so they can install call instruction if needed. > > > > > > The arch_uprobe_optimize triggers the uprobe optimization and is called after > > > first uprobe hit. I originally had it called on uprobe installation but then > > > it clashed with elf loader, because the user space trampoline was added in a > > > place where loader might need to put elf segments, so I decided to do it after > > > first uprobe hit when loading is done. > > > > > > We do not unmap and release uprobe trampoline when it's no longer needed, > > > because there's no easy way to make sure none of the threads is still > > > inside the trampoline. But we do not waste memory, because there's just > > > single page for all the uprobe trampoline mappings. > > > > > > We do waste frmae on page mapping for every 4GB by keeping the uprobe > > > trampoline page mapped, but that seems ok. > > > > > > Attaching the speed up from benchs/run_bench_uprobes.sh script: > > > > > > current: > > > usermode-count : 818.836 ± 2.842M/s > > > syscall-count : 8.917 ± 0.003M/s > > > uprobe-nop : 3.056 ± 0.013M/s > > > uprobe-push : 2.903 ± 0.002M/s > > > uprobe-ret : 1.533 ± 0.001M/s > > > --> uprobe-nop5 : 1.492 ± 0.000M/s > > > uretprobe-nop : 1.783 ± 0.000M/s > > > uretprobe-push : 1.672 ± 0.001M/s > > > uretprobe-ret : 1.067 ± 0.002M/s > > > --> uretprobe-nop5 : 1.052 ± 0.000M/s > > > > > > after the change: > > > > > > usermode-count : 818.386 ± 1.886M/s > > > syscall-count : 8.923 ± 0.003M/s > > > uprobe-nop : 3.086 ± 0.005M/s > > > uprobe-push : 2.751 ± 0.001M/s > > > uprobe-ret : 1.481 ± 0.000M/s > > > --> uprobe-nop5 : 4.016 ± 0.002M/s > > > uretprobe-nop : 1.712 ± 0.008M/s > > > uretprobe-push : 1.616 ± 0.001M/s > > > uretprobe-ret : 1.052 ± 0.000M/s > > > --> uretprobe-nop5 : 2.015 ± 0.000M/s > > > > > > Signed-off-by: Jiri Olsa <jolsa@xxxxxxxxxx> > > > --- > > > arch/x86/include/asm/uprobes.h | 6 ++ > > > arch/x86/kernel/uprobes.c | 191 ++++++++++++++++++++++++++++++++- > > > include/linux/uprobes.h | 6 +- > > > kernel/events/uprobes.c | 16 ++- > > > 4 files changed, 209 insertions(+), 10 deletions(-) > > > > > > diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h > > > index 678fb546f0a7..7d4df920bb59 100644 > > > --- a/arch/x86/include/asm/uprobes.h > > > +++ b/arch/x86/include/asm/uprobes.h > > > @@ -20,6 +20,10 @@ typedef u8 uprobe_opcode_t; > > > #define UPROBE_SWBP_INSN 0xcc > > > #define UPROBE_SWBP_INSN_SIZE 1 > > > > > > +enum { > > > + ARCH_UPROBE_FLAG_CAN_OPTIMIZE = 0, > > > +}; > > > + > > > struct uprobe_xol_ops; > > > > > > struct arch_uprobe { > > > @@ -45,6 +49,8 @@ struct arch_uprobe { > > > u8 ilen; > > > } push; > > > }; > > > + > > > + unsigned long flags; > > > }; > > > > > > struct arch_uprobe_task { > > > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c > > > index e8aebbda83bc..73ddff823904 100644 > > > --- a/arch/x86/kernel/uprobes.c > > > +++ b/arch/x86/kernel/uprobes.c > > > @@ -18,6 +18,7 @@ > > > #include <asm/processor.h> > > > #include <asm/insn.h> > > > #include <asm/mmu_context.h> > > > +#include <asm/nops.h> > > > > > > /* Post-execution fixups. */ > > > > > > @@ -768,7 +769,7 @@ static struct uprobe_trampoline *create_uprobe_trampoline(unsigned long vaddr) > > > return NULL; > > > } > > > > > > -static __maybe_unused struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr) > > > +static struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr) > > > { > > > struct uprobes_state *state = ¤t->mm->uprobes_state; > > > struct uprobe_trampoline *tramp = NULL; > > > @@ -794,7 +795,7 @@ static void destroy_uprobe_trampoline(struct uprobe_trampoline *tramp) > > > kfree(tramp); > > > } > > > > > > -static __maybe_unused void uprobe_trampoline_put(struct uprobe_trampoline *tramp) > > > +static void uprobe_trampoline_put(struct uprobe_trampoline *tramp) > > > { > > > if (tramp == NULL) > > > return; > > > @@ -807,6 +808,7 @@ struct mm_uprobe { > > > struct rb_node rb_node; > > > unsigned long auprobe; > > > unsigned long vaddr; > > > + bool optimized; > > > }; > > > > > > > I'm trying to understand if this RB-tree based mm_uprobe is strictly > > necessary. Is it? Sure we keep optimized flag, but that's more for > > defensive checks, no? Is there any other reason we need this separate > > look up data structure? > > so the call instruction update is done in 2 locked steps: > - first we write breakpoint as part of normal uprobe registration > - then uprobe is hit, we overwrite breakpoint with call instruction > > in between we could race with another thread that could either unregister the > uprobe or try to optimize the uprobe as well > > I think we either need to keep the state of the uprobe per process (mm_struct), > or we would need to read the probed instruction each time when we need to make > decision based on what state are we at (nop5,breakpoint,call) This decision is only done in "slow path", right? Only when registering/unregistering. And those operations are done under lock. So reading those 5 bytes every time we register/unregister seems completely acceptable, rather than now *also* having a per-mm uprobe lookup tree. > > jirka