Re: [PATCH RFCv2 12/18] uprobes/x86: Add support to optimize uprobes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Feb 24, 2025 at 6:04 AM Jiri Olsa <jolsa@xxxxxxxxxx> wrote:
>
> Putting together all the previously added pieces to support optimized
> uprobes on top of 5-byte nop instruction.
>
> The current uprobe execution goes through following:
>   - installs breakpoint instruction over original instruction
>   - exception handler hit and calls related uprobe consumers
>   - and either simulates original instruction or does out of line single step
>     execution of it
>   - returns to user space
>
> The optimized uprobe path
>
>   - checks the original instruction is 5-byte nop (plus other checks)
>   - adds (or uses existing) user space trampoline and overwrites original
>     instruction (5-byte nop) with call to user space trampoline
>   - the user space trampoline executes uprobe syscall that calls related uprobe
>     consumers
>   - trampoline returns back to next instruction
>
> This approach won't speed up all uprobes as it's limited to using nop5 as
> original instruction, but we could use nop5 as USDT probe instruction (which
> uses single byte nop ATM) and speed up the USDT probes.
>
> This patch overloads related arch functions in uprobe_write_opcode and
> set_orig_insn so they can install call instruction if needed.
>
> The arch_uprobe_optimize triggers the uprobe optimization and is called after
> first uprobe hit. I originally had it called on uprobe installation but then
> it clashed with elf loader, because the user space trampoline was added in a
> place where loader might need to put elf segments, so I decided to do it after
> first uprobe hit when loading is done.
>
> We do not unmap and release uprobe trampoline when it's no longer needed,
> because there's no easy way to make sure none of the threads is still
> inside the trampoline. But we do not waste memory, because there's just
> single page for all the uprobe trampoline mappings.
>
> We do waste frmae on page mapping for every 4GB by keeping the uprobe
> trampoline page mapped, but that seems ok.
>
> Attaching the speed up from benchs/run_bench_uprobes.sh script:
>
> current:
>         usermode-count :  818.836 ± 2.842M/s
>         syscall-count  :    8.917 ± 0.003M/s
>         uprobe-nop     :    3.056 ± 0.013M/s
>         uprobe-push    :    2.903 ± 0.002M/s
>         uprobe-ret     :    1.533 ± 0.001M/s
> -->     uprobe-nop5    :    1.492 ± 0.000M/s
>         uretprobe-nop  :    1.783 ± 0.000M/s
>         uretprobe-push :    1.672 ± 0.001M/s
>         uretprobe-ret  :    1.067 ± 0.002M/s
> -->     uretprobe-nop5 :    1.052 ± 0.000M/s
>
> after the change:
>
>         usermode-count :  818.386 ± 1.886M/s
>         syscall-count  :    8.923 ± 0.003M/s
>         uprobe-nop     :    3.086 ± 0.005M/s
>         uprobe-push    :    2.751 ± 0.001M/s
>         uprobe-ret     :    1.481 ± 0.000M/s
> -->     uprobe-nop5    :    4.016 ± 0.002M/s
>         uretprobe-nop  :    1.712 ± 0.008M/s
>         uretprobe-push :    1.616 ± 0.001M/s
>         uretprobe-ret  :    1.052 ± 0.000M/s
> -->     uretprobe-nop5 :    2.015 ± 0.000M/s
>
> Signed-off-by: Jiri Olsa <jolsa@xxxxxxxxxx>
> ---
>  arch/x86/include/asm/uprobes.h |   6 ++
>  arch/x86/kernel/uprobes.c      | 191 ++++++++++++++++++++++++++++++++-
>  include/linux/uprobes.h        |   6 +-
>  kernel/events/uprobes.c        |  16 ++-
>  4 files changed, 209 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
> index 678fb546f0a7..7d4df920bb59 100644
> --- a/arch/x86/include/asm/uprobes.h
> +++ b/arch/x86/include/asm/uprobes.h
> @@ -20,6 +20,10 @@ typedef u8 uprobe_opcode_t;
>  #define UPROBE_SWBP_INSN               0xcc
>  #define UPROBE_SWBP_INSN_SIZE             1
>
> +enum {
> +       ARCH_UPROBE_FLAG_CAN_OPTIMIZE   = 0,
> +};
> +
>  struct uprobe_xol_ops;
>
>  struct arch_uprobe {
> @@ -45,6 +49,8 @@ struct arch_uprobe {
>                         u8      ilen;
>                 }                       push;
>         };
> +
> +       unsigned long flags;
>  };
>
>  struct arch_uprobe_task {
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index e8aebbda83bc..73ddff823904 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -18,6 +18,7 @@
>  #include <asm/processor.h>
>  #include <asm/insn.h>
>  #include <asm/mmu_context.h>
> +#include <asm/nops.h>
>
>  /* Post-execution fixups. */
>
> @@ -768,7 +769,7 @@ static struct uprobe_trampoline *create_uprobe_trampoline(unsigned long vaddr)
>         return NULL;
>  }
>
> -static __maybe_unused struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr)
> +static struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr)
>  {
>         struct uprobes_state *state = &current->mm->uprobes_state;
>         struct uprobe_trampoline *tramp = NULL;
> @@ -794,7 +795,7 @@ static void destroy_uprobe_trampoline(struct uprobe_trampoline *tramp)
>         kfree(tramp);
>  }
>
> -static __maybe_unused void uprobe_trampoline_put(struct uprobe_trampoline *tramp)
> +static void uprobe_trampoline_put(struct uprobe_trampoline *tramp)
>  {
>         if (tramp == NULL)
>                 return;
> @@ -807,6 +808,7 @@ struct mm_uprobe {
>         struct rb_node rb_node;
>         unsigned long auprobe;
>         unsigned long vaddr;
> +       bool optimized;
>  };
>

I'm trying to understand if this RB-tree based mm_uprobe is strictly
necessary. Is it? Sure we keep optimized flag, but that's more for
defensive checks, no? Is there any other reason we need this separate
look up data structure?

>  #define __node_2_mm_uprobe(node) rb_entry((node), struct mm_uprobe, rb_node)
> @@ -874,6 +876,7 @@ static struct mm_uprobe *insert_mm_uprobe(struct mm_struct *mm, struct arch_upro
>         if (mmu) {
>                 mmu->auprobe = (unsigned long) auprobe;
>                 mmu->vaddr = vaddr;
> +               mmu->optimized = false;
>                 RB_CLEAR_NODE(&mmu->rb_node);
>                 rb_add(&mmu->rb_node, &mm->uprobes_state.root_uprobes, __mm_uprobe_less);
>         }

[...]





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux