Re: [LSF/MM/BPF TOPIC] faster uprobes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Mar 05, 2024 at 09:24:08AM +0100, Jiri Olsa wrote:
> On Mon, Mar 04, 2024 at 04:55:33PM -0800, Andrii Nakryiko wrote:
> > On Sun, Mar 3, 2024 at 2:20 AM Jiri Olsa <olsajiri@xxxxxxxxx> wrote:
> > >
> > > On Fri, Mar 01, 2024 at 09:26:57AM -0800, Andrii Nakryiko wrote:
> > > > On Fri, Mar 1, 2024 at 9:01 AM Alexei Starovoitov
> > > > <alexei.starovoitov@xxxxxxxxx> wrote:
> > > > >
> > > > > On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa <olsajiri@xxxxxxxxx> wrote:
> > > > > >
> > > > > > On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote:
> > > > > > > On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@xxxxxxxxx> wrote:
> > > > > > > >
> > > > > > > > One of uprobe pain points is having slow execution that involves
> > > > > > > > two traps in worst case scenario or single trap if the original
> > > > > > > > instruction can be emulated. For return uprobes there's one extra
> > > > > > > > trap on top of that.
> > > > > > > >
> > > > > > > > My current idea on how to make this faster is to follow the optimized
> > > > > > > > kprobes and replace the normal uprobe trap instruction with jump to
> > > > > > > > user space trampoline that:
> > > > > > > >
> > > > > > > >   - executes syscall to call uprobe consumers callbacks
> > > > > > >
> > > > > > > Did you get a chance to measure relative performance of syscall vs
> > > > > > > int3 interrupt handling? If not, do you think you'll be able to get
> > > > > > > some numbers by the time the conference starts? This should inform the
> > > > > > > decision whether it even makes sense to go through all the trouble.
> > > > > >
> > > > > > right, will do that
> > > > >
> > > > > I believe Yusheng measured syscall vs uprobe performance
> > > > > difference during LPC. iirc it was something like 3x.
> > > >
> > > > Do you have a link to slides? Was it actual uprobe vs just some fast
> > > > syscall (not doing BPF program execution) comparison? Or comparing the
> > > > performance of int3 handling vs equivalent syscall handling.
> > > >
> > > > I suspect it's the former, and so probably not that representative.
> > > > I'm curious about the performance of going
> > > > userspace->kernel->userspace through int3 vs syscall (all other things
> > > > being equal).
> > >
> > > I have a simple test [1] comparing:
> > >   - uprobe with 2 traps
> > >   - uprobe with 1 trap
> > >   - syscall executing uprobe
> > >
> > > the syscall takes uprobe address as argument, finds the uprobe and executes
> > > its consumers, which should be comparable to what the trampoline will do
> > >
> > > test does same amount of loops triggering each uprobe type and measures
> > > the time it took
> > >
> > >   # ./test_progs -t uprobe_syscall_bench -v
> > >   bpf_testmod.ko is already unloaded.
> > >   Loading bpf_testmod.ko...
> > >   Successfully loaded bpf_testmod.ko.
> > >   test_bench_1:PASS:uprobe_bench__open_and_load 0 nsec
> > >   test_bench_1:PASS:uprobe_bench__attach 0 nsec
> > >   test_bench_1:PASS:uprobe1_cnt 0 nsec
> > >   test_bench_1:PASS:syscalls_uprobe1_cnt 0 nsec
> > >   test_bench_1:PASS:uprobe2_cnt 0 nsec
> > >   test_bench_1: uprobes (1 trap) in  36.439s
> > >   test_bench_1: uprobes (2 trap) in  91.960s
> > >   test_bench_1: syscalls         in  17.872s
> > >   #395/1   uprobe_syscall_bench/bench_1:OK
> > >   #395     uprobe_syscall_bench:OK
> > >   Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED
> > >
> > > syscall uprobe execution seems to be ~2x faster than 1 trap uprobe
> > > and ~5x faster than 2 traps uprobe
> > >
> > 
> > Thanks for running benchmarks! I quickly looked at the selftest and
> > noticed this:
> > 
> > +/*
> > + * Assuming following prolog:
> > + *
> > + * 6984ac:       55                      push   %rbp
> > + * 6984ad:       48 89 e5                mov    %rsp,%rbp
> > + */
> > +noinline void uprobe2_bench_trigger(void)
> > +{
> > +        asm volatile ("");
> > +}
> > 
> > This actually will be optimized out to just ret in -O2 mode (make
> > RELEASE=1 for selftests):
> > 
> > 00000000005a0ce0 <uprobe2_bench_trigger>:
> >   5a0ce0: c3                            retq
> >   5a0ce1: 66 66 2e 0f 1f 84 00 00 00 00 00      nopw    %cs:(%rax,%rax)
> >   5a0cec: 0f 1f 40 00                   nopl    (%rax)
> > 
> > So be careful with that.
> 
> right, I did not mean for this to be checked in, just wanted to get the
> numbers quickly
> 
> > 
> > Also, I just updated our existing set of uprobe benchmarks (see [0]),
> > do you mind adding your syscall-based one as another one there and
> > running all of them and sharing the numbers with us? Very curious to
> > see both absolute and relative numbers from that benchmark. (and
> > please do build with RELEASE=1)
> > 
> > You should be able to just run benchs/run_bench_uprobes.sh (also don't
> > forget to add your syscall-based benchmark to the list of benchmarks
> > in that shell script).
> 
> yes, saw it and was going to run/compare it.. it's good idea to add
> the syscall one and get all numbers together, will do that
> 
> > 
> > Thank you!
> > 
> > 
> > BTW, while I think patching multiple instructions for syscall-based
> > uprobe is going to be extremely tricky, I think at least u*ret*probe's
> > int3 can be pretty easily optimized away with syscall, given that the
> > kernel controls code generation there. If anything, it will get the
> > uretprobe case a bit closer to the performance of uprobe. Give it some
> > thought.
> 
> hm, right.. the trampoline is there already, but at the moment is global
> and used by all uretprobes.. and int3 code moves userspace (changes rip)
> to the original return address.. maybe we can do that through syscall
> as well

it seems like good idea, I tried change below (use syscall on return
trampoline) and got some speedup:

current:
  base           :   15.817 ± 0.009M/s
  uprobe-nop     :    2.901 ± 0.000M/s
  uprobe-push    :    2.743 ± 0.002M/s
  uprobe-ret     :    1.089 ± 0.001M/s
  uretprobe-nop  :    1.448 ± 0.001M/s
  uretprobe-push :    1.407 ± 0.001M/s
  uretprobe-ret  :    0.792 ± 0.001M/s

with syscall:
  base           :   15.831 ± 0.026M/s
  uprobe-nop     :    2.904 ± 0.001M/s
  uprobe-push    :    2.764 ± 0.002M/s
  uprobe-ret     :    1.082 ± 0.001M/s
  uretprobe-nop  :    1.785 ± 0.000M/s
  uretprobe-push :    1.733 ± 0.001M/s
  uretprobe-ret  :    0.885 ± 0.004M/s

~23% for nop/push (emulated) cases, ~11% for ret (sstep) case

jirka


---
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 7e8d46f4147f..fa5f8a058bc2 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -383,6 +383,7 @@
 459	common	lsm_get_self_attr	sys_lsm_get_self_attr
 460	common	lsm_set_self_attr	sys_lsm_set_self_attr
 461	common	lsm_list_modules	sys_lsm_list_modules
+462	64	uprobe			sys_uprobe
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 6c07f6daaa22..fceef2b4e243 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -12,11 +12,13 @@
 #include <linux/ptrace.h>
 #include <linux/uprobes.h>
 #include <linux/uaccess.h>
+#include <linux/syscalls.h>
 
 #include <linux/kdebug.h>
 #include <asm/processor.h>
 #include <asm/insn.h>
 #include <asm/mmu_context.h>
+#include <asm/syscalls.h>
 
 /* Post-execution fixups. */
 
@@ -308,6 +310,53 @@ static int uprobe_init_insn(struct arch_uprobe *auprobe, struct insn *insn, bool
 }
 
 #ifdef CONFIG_X86_64
+
+asm (
+       ".pushsection .rodata\n"
+       ".global uretprobe_syscall_entry\n"
+       "uretprobe_syscall_entry:\n"
+       "push %rax\n"
+       "mov $462, %rax\n"
+       "syscall\n"
+       ".global uretprobe_syscall_end\n"
+       "uretprobe_syscall_end:\n"
+       ".popsection\n"
+);
+
+extern __visible u8 uretprobe_syscall_entry[];
+extern __visible u8 uretprobe_syscall_end[];
+
+uprobe_opcode_t* arch_uprobe_trampoline(unsigned long *psize)
+{
+	*psize = uretprobe_syscall_end - uretprobe_syscall_entry;
+	return uretprobe_syscall_entry;
+}
+
+SYSCALL_DEFINE1(uprobe, unsigned long, cmd)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+	unsigned long ax, err;
+
+	/*
+	 * We get invoked from the trampoline that pushed rax
+	 * value on stack, read and restore the value.
+	 */
+	err = copy_from_user((void*) &ax, (void *) regs->sp, sizeof(ax));
+	WARN_ON_ONCE(err);
+
+	regs->ax = ax;
+	regs->orig_ax = ax;
+
+	/*
+	 * And pop the stack back, because we jump to original
+	 * return value.
+	 */
+	regs->sp = regs->sp + sizeof(regs->sp);
+
+	uprobe_handle_trampoline(regs);
+	return ax;
+}
+
 /*
  * If arch_uprobe->insn doesn't use rip-relative addressing, return
  * immediately.  Otherwise, rewrite the instruction so that it accesses
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 77eb9b0e7685..4f7d5b41b718 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -987,6 +987,8 @@ asmlinkage long sys_spu_run(int fd, __u32 __user *unpc,
 asmlinkage long sys_spu_create(const char __user *name,
 		unsigned int flags, umode_t mode, int fd);
 
+asmlinkage long sys_uprobe(unsigned long cmd);
+
 
 /*
  * Deprecated system calls which are still defined in
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index f46e0ca0169c..9ef244c8ff19 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -138,6 +138,8 @@ extern bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c
 extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
 extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
 					 void *src, unsigned long len);
+extern void uprobe_handle_trampoline(struct pt_regs *regs);
+uprobe_opcode_t* arch_uprobe_trampoline(unsigned long *psize);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
 };
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 75f00965ab15..2702799648e6 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -842,8 +842,11 @@ __SYSCALL(__NR_lsm_set_self_attr, sys_lsm_set_self_attr)
 #define __NR_lsm_list_modules 461
 __SYSCALL(__NR_lsm_list_modules, sys_lsm_list_modules)
 
+#define __NR_uprobe 462
+__SYSCALL(__NR_uprobe, sys_uprobe)
+
 #undef __NR_syscalls
-#define __NR_syscalls 462
+#define __NR_syscalls 463
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 929e98c62965..fefaf4804e1f 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1474,10 +1474,19 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
 	return ret;
 }
 
+uprobe_opcode_t* __weak arch_uprobe_trampoline(unsigned long *psize)
+{
+	static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
+
+	*psize = UPROBE_SWBP_INSN_SIZE;
+	return &insn;
+}
+
 static struct xol_area *__create_xol_area(unsigned long vaddr)
 {
 	struct mm_struct *mm = current->mm;
-	uprobe_opcode_t insn = UPROBE_SWBP_INSN;
+	unsigned long insns_size;
+	uprobe_opcode_t *insns;
 	struct xol_area *area;
 
 	area = kmalloc(sizeof(*area), GFP_KERNEL);
@@ -1502,7 +1511,8 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
 	/* Reserve the 1st slot for get_trampoline_vaddr() */
 	set_bit(0, area->bitmap);
 	atomic_set(&area->slot_count, 1);
-	arch_uprobe_copy_ixol(area->pages[0], 0, &insn, UPROBE_SWBP_INSN_SIZE);
+	insns = arch_uprobe_trampoline(&insns_size);
+	arch_uprobe_copy_ixol(area->pages[0], 0, insns, insns_size);
 
 	if (!xol_add_vma(mm, area))
 		return area;
@@ -2123,7 +2133,7 @@ static struct return_instance *find_next_ret_chain(struct return_instance *ri)
 	return ri;
 }
 
-static void handle_trampoline(struct pt_regs *regs)
+void uprobe_handle_trampoline(struct pt_regs *regs)
 {
 	struct uprobe_task *utask;
 	struct return_instance *ri, *next;
@@ -2188,7 +2198,7 @@ static void handle_swbp(struct pt_regs *regs)
 
 	bp_vaddr = uprobe_get_swbp_addr(regs);
 	if (bp_vaddr == get_trampoline_vaddr())
-		return handle_trampoline(regs);
+		return uprobe_handle_trampoline(regs);
 
 	uprobe = find_active_uprobe(bp_vaddr, &is_swbp);
 	if (!uprobe) {
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index faad00cce269..ddc954f28317 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -391,3 +391,5 @@ COND_SYSCALL(setuid16);
 
 /* restartable sequence */
 COND_SYSCALL(rseq);
+
+COND_SYSCALL(uprobe);




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux