Re: [PATCH RFCv2 12/18] uprobes/x86: Add support to optimize uprobes

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Fri, 28 Feb 2025 15:00:22 -0800

On Fri, Feb 28, 2025 at 2:55 PM Jiri Olsa <olsajiri@xxxxxxxxx> wrote:
>
> On Fri, Feb 28, 2025 at 10:55:24AM -0800, Andrii Nakryiko wrote:
> > On Mon, Feb 24, 2025 at 6:04 AM Jiri Olsa <jolsa@xxxxxxxxxx> wrote:
> > >
> > > Putting together all the previously added pieces to support optimized
> > > uprobes on top of 5-byte nop instruction.
> > >
> > > The current uprobe execution goes through following:
> > >   - installs breakpoint instruction over original instruction
> > >   - exception handler hit and calls related uprobe consumers
> > >   - and either simulates original instruction or does out of line single step
> > >     execution of it
> > >   - returns to user space
> > >
> > > The optimized uprobe path
> > >
> > >   - checks the original instruction is 5-byte nop (plus other checks)
> > >   - adds (or uses existing) user space trampoline and overwrites original
> > >     instruction (5-byte nop) with call to user space trampoline
> > >   - the user space trampoline executes uprobe syscall that calls related uprobe
> > >     consumers
> > >   - trampoline returns back to next instruction
> > >
> > > This approach won't speed up all uprobes as it's limited to using nop5 as
> > > original instruction, but we could use nop5 as USDT probe instruction (which
> > > uses single byte nop ATM) and speed up the USDT probes.
> > >
> > > This patch overloads related arch functions in uprobe_write_opcode and
> > > set_orig_insn so they can install call instruction if needed.
> > >
> > > The arch_uprobe_optimize triggers the uprobe optimization and is called after
> > > first uprobe hit. I originally had it called on uprobe installation but then
> > > it clashed with elf loader, because the user space trampoline was added in a
> > > place where loader might need to put elf segments, so I decided to do it after
> > > first uprobe hit when loading is done.
> > >
> > > We do not unmap and release uprobe trampoline when it's no longer needed,
> > > because there's no easy way to make sure none of the threads is still
> > > inside the trampoline. But we do not waste memory, because there's just
> > > single page for all the uprobe trampoline mappings.
> > >
> > > We do waste frmae on page mapping for every 4GB by keeping the uprobe
> > > trampoline page mapped, but that seems ok.
> > >
> > > Attaching the speed up from benchs/run_bench_uprobes.sh script:
> > >
> > > current:
> > >         usermode-count :  818.836 ± 2.842M/s
> > >         syscall-count  :    8.917 ± 0.003M/s
> > >         uprobe-nop     :    3.056 ± 0.013M/s
> > >         uprobe-push    :    2.903 ± 0.002M/s
> > >         uprobe-ret     :    1.533 ± 0.001M/s
> > > -->     uprobe-nop5    :    1.492 ± 0.000M/s
> > >         uretprobe-nop  :    1.783 ± 0.000M/s
> > >         uretprobe-push :    1.672 ± 0.001M/s
> > >         uretprobe-ret  :    1.067 ± 0.002M/s
> > > -->     uretprobe-nop5 :    1.052 ± 0.000M/s
> > >
> > > after the change:
> > >
> > >         usermode-count :  818.386 ± 1.886M/s
> > >         syscall-count  :    8.923 ± 0.003M/s
> > >         uprobe-nop     :    3.086 ± 0.005M/s
> > >         uprobe-push    :    2.751 ± 0.001M/s
> > >         uprobe-ret     :    1.481 ± 0.000M/s
> > > -->     uprobe-nop5    :    4.016 ± 0.002M/s
> > >         uretprobe-nop  :    1.712 ± 0.008M/s
> > >         uretprobe-push :    1.616 ± 0.001M/s
> > >         uretprobe-ret  :    1.052 ± 0.000M/s
> > > -->     uretprobe-nop5 :    2.015 ± 0.000M/s
> > >
> > > Signed-off-by: Jiri Olsa <jolsa@xxxxxxxxxx>
> > > ---
> > >  arch/x86/include/asm/uprobes.h |   6 ++
> > >  arch/x86/kernel/uprobes.c      | 191 ++++++++++++++++++++++++++++++++-
> > >  include/linux/uprobes.h        |   6 +-
> > >  kernel/events/uprobes.c        |  16 ++-
> > >  4 files changed, 209 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
> > > index 678fb546f0a7..7d4df920bb59 100644
> > > --- a/arch/x86/include/asm/uprobes.h
> > > +++ b/arch/x86/include/asm/uprobes.h
> > > @@ -20,6 +20,10 @@ typedef u8 uprobe_opcode_t;
> > >  #define UPROBE_SWBP_INSN               0xcc
> > >  #define UPROBE_SWBP_INSN_SIZE             1
> > >
> > > +enum {
> > > +       ARCH_UPROBE_FLAG_CAN_OPTIMIZE   = 0,
> > > +};
> > > +
> > >  struct uprobe_xol_ops;
> > >
> > >  struct arch_uprobe {
> > > @@ -45,6 +49,8 @@ struct arch_uprobe {
> > >                         u8      ilen;
> > >                 }                       push;
> > >         };
> > > +
> > > +       unsigned long flags;
> > >  };
> > >
> > >  struct arch_uprobe_task {
> > > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> > > index e8aebbda83bc..73ddff823904 100644
> > > --- a/arch/x86/kernel/uprobes.c
> > > +++ b/arch/x86/kernel/uprobes.c
> > > @@ -18,6 +18,7 @@
> > >  #include <asm/processor.h>
> > >  #include <asm/insn.h>
> > >  #include <asm/mmu_context.h>
> > > +#include <asm/nops.h>
> > >
> > >  /* Post-execution fixups. */
> > >
> > > @@ -768,7 +769,7 @@ static struct uprobe_trampoline *create_uprobe_trampoline(unsigned long vaddr)
> > >         return NULL;
> > >  }
> > >
> > > -static __maybe_unused struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr)
> > > +static struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr)
> > >  {
> > >         struct uprobes_state *state = &current->mm->uprobes_state;
> > >         struct uprobe_trampoline *tramp = NULL;
> > > @@ -794,7 +795,7 @@ static void destroy_uprobe_trampoline(struct uprobe_trampoline *tramp)
> > >         kfree(tramp);
> > >  }
> > >
> > > -static __maybe_unused void uprobe_trampoline_put(struct uprobe_trampoline *tramp)
> > > +static void uprobe_trampoline_put(struct uprobe_trampoline *tramp)
> > >  {
> > >         if (tramp == NULL)
> > >                 return;
> > > @@ -807,6 +808,7 @@ struct mm_uprobe {
> > >         struct rb_node rb_node;
> > >         unsigned long auprobe;
> > >         unsigned long vaddr;
> > > +       bool optimized;
> > >  };
> > >
> >
> > I'm trying to understand if this RB-tree based mm_uprobe is strictly
> > necessary. Is it? Sure we keep optimized flag, but that's more for
> > defensive checks, no? Is there any other reason we need this separate
> > look up data structure?
>
> so the call instruction update is done in 2 locked steps:
>  - first we write breakpoint as part of normal uprobe registration
>  - then uprobe is hit, we overwrite breakpoint with call instruction
>
> in between we could race with another thread that could either unregister the
> uprobe or try to optimize the uprobe as well
>
> I think we either need to keep the state of the uprobe per process (mm_struct),
> or we would need to read the probed instruction each time when we need to make
> decision based on what state are we at (nop5,breakpoint,call)

This decision is only done in "slow path", right? Only when
registering/unregistering. And those operations are done under lock.
So reading those 5 bytes every time we register/unregister seems
completely acceptable, rather than now *also* having a per-mm uprobe
lookup tree.

>
> jirka