On Tue, Mar 05, 2024 at 09:24:08AM +0100, Jiri Olsa wrote: > On Mon, Mar 04, 2024 at 04:55:33PM -0800, Andrii Nakryiko wrote: > > On Sun, Mar 3, 2024 at 2:20 AM Jiri Olsa <olsajiri@xxxxxxxxx> wrote: > > > > > > On Fri, Mar 01, 2024 at 09:26:57AM -0800, Andrii Nakryiko wrote: > > > > On Fri, Mar 1, 2024 at 9:01 AM Alexei Starovoitov > > > > <alexei.starovoitov@xxxxxxxxx> wrote: > > > > > > > > > > On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa <olsajiri@xxxxxxxxx> wrote: > > > > > > > > > > > > On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote: > > > > > > > On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@xxxxxxxxx> wrote: > > > > > > > > > > > > > > > > One of uprobe pain points is having slow execution that involves > > > > > > > > two traps in worst case scenario or single trap if the original > > > > > > > > instruction can be emulated. For return uprobes there's one extra > > > > > > > > trap on top of that. > > > > > > > > > > > > > > > > My current idea on how to make this faster is to follow the optimized > > > > > > > > kprobes and replace the normal uprobe trap instruction with jump to > > > > > > > > user space trampoline that: > > > > > > > > > > > > > > > > - executes syscall to call uprobe consumers callbacks > > > > > > > > > > > > > > Did you get a chance to measure relative performance of syscall vs > > > > > > > int3 interrupt handling? If not, do you think you'll be able to get > > > > > > > some numbers by the time the conference starts? This should inform the > > > > > > > decision whether it even makes sense to go through all the trouble. > > > > > > > > > > > > right, will do that > > > > > > > > > > I believe Yusheng measured syscall vs uprobe performance > > > > > difference during LPC. iirc it was something like 3x. > > > > > > > > Do you have a link to slides? Was it actual uprobe vs just some fast > > > > syscall (not doing BPF program execution) comparison? Or comparing the > > > > performance of int3 handling vs equivalent syscall handling. > > > > > > > > I suspect it's the former, and so probably not that representative. > > > > I'm curious about the performance of going > > > > userspace->kernel->userspace through int3 vs syscall (all other things > > > > being equal). > > > > > > I have a simple test [1] comparing: > > > - uprobe with 2 traps > > > - uprobe with 1 trap > > > - syscall executing uprobe > > > > > > the syscall takes uprobe address as argument, finds the uprobe and executes > > > its consumers, which should be comparable to what the trampoline will do > > > > > > test does same amount of loops triggering each uprobe type and measures > > > the time it took > > > > > > # ./test_progs -t uprobe_syscall_bench -v > > > bpf_testmod.ko is already unloaded. > > > Loading bpf_testmod.ko... > > > Successfully loaded bpf_testmod.ko. > > > test_bench_1:PASS:uprobe_bench__open_and_load 0 nsec > > > test_bench_1:PASS:uprobe_bench__attach 0 nsec > > > test_bench_1:PASS:uprobe1_cnt 0 nsec > > > test_bench_1:PASS:syscalls_uprobe1_cnt 0 nsec > > > test_bench_1:PASS:uprobe2_cnt 0 nsec > > > test_bench_1: uprobes (1 trap) in 36.439s > > > test_bench_1: uprobes (2 trap) in 91.960s > > > test_bench_1: syscalls in 17.872s > > > #395/1 uprobe_syscall_bench/bench_1:OK > > > #395 uprobe_syscall_bench:OK > > > Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED > > > > > > syscall uprobe execution seems to be ~2x faster than 1 trap uprobe > > > and ~5x faster than 2 traps uprobe > > > > > > > Thanks for running benchmarks! I quickly looked at the selftest and > > noticed this: > > > > +/* > > + * Assuming following prolog: > > + * > > + * 6984ac: 55 push %rbp > > + * 6984ad: 48 89 e5 mov %rsp,%rbp > > + */ > > +noinline void uprobe2_bench_trigger(void) > > +{ > > + asm volatile (""); > > +} > > > > This actually will be optimized out to just ret in -O2 mode (make > > RELEASE=1 for selftests): > > > > 00000000005a0ce0 <uprobe2_bench_trigger>: > > 5a0ce0: c3 retq > > 5a0ce1: 66 66 2e 0f 1f 84 00 00 00 00 00 nopw %cs:(%rax,%rax) > > 5a0cec: 0f 1f 40 00 nopl (%rax) > > > > So be careful with that. > > right, I did not mean for this to be checked in, just wanted to get the > numbers quickly > > > > > Also, I just updated our existing set of uprobe benchmarks (see [0]), > > do you mind adding your syscall-based one as another one there and > > running all of them and sharing the numbers with us? Very curious to > > see both absolute and relative numbers from that benchmark. (and > > please do build with RELEASE=1) > > > > You should be able to just run benchs/run_bench_uprobes.sh (also don't > > forget to add your syscall-based benchmark to the list of benchmarks > > in that shell script). > > yes, saw it and was going to run/compare it.. it's good idea to add > the syscall one and get all numbers together, will do that > > > > > Thank you! > > > > > > BTW, while I think patching multiple instructions for syscall-based > > uprobe is going to be extremely tricky, I think at least u*ret*probe's > > int3 can be pretty easily optimized away with syscall, given that the > > kernel controls code generation there. If anything, it will get the > > uretprobe case a bit closer to the performance of uprobe. Give it some > > thought. > > hm, right.. the trampoline is there already, but at the moment is global > and used by all uretprobes.. and int3 code moves userspace (changes rip) > to the original return address.. maybe we can do that through syscall > as well it seems like good idea, I tried change below (use syscall on return trampoline) and got some speedup: current: base : 15.817 ± 0.009M/s uprobe-nop : 2.901 ± 0.000M/s uprobe-push : 2.743 ± 0.002M/s uprobe-ret : 1.089 ± 0.001M/s uretprobe-nop : 1.448 ± 0.001M/s uretprobe-push : 1.407 ± 0.001M/s uretprobe-ret : 0.792 ± 0.001M/s with syscall: base : 15.831 ± 0.026M/s uprobe-nop : 2.904 ± 0.001M/s uprobe-push : 2.764 ± 0.002M/s uprobe-ret : 1.082 ± 0.001M/s uretprobe-nop : 1.785 ± 0.000M/s uretprobe-push : 1.733 ± 0.001M/s uretprobe-ret : 0.885 ± 0.004M/s ~23% for nop/push (emulated) cases, ~11% for ret (sstep) case jirka --- diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 7e8d46f4147f..fa5f8a058bc2 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -383,6 +383,7 @@ 459 common lsm_get_self_attr sys_lsm_get_self_attr 460 common lsm_set_self_attr sys_lsm_set_self_attr 461 common lsm_list_modules sys_lsm_list_modules +462 64 uprobe sys_uprobe # # Due to a historical design error, certain syscalls are numbered differently diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c index 6c07f6daaa22..fceef2b4e243 100644 --- a/arch/x86/kernel/uprobes.c +++ b/arch/x86/kernel/uprobes.c @@ -12,11 +12,13 @@ #include <linux/ptrace.h> #include <linux/uprobes.h> #include <linux/uaccess.h> +#include <linux/syscalls.h> #include <linux/kdebug.h> #include <asm/processor.h> #include <asm/insn.h> #include <asm/mmu_context.h> +#include <asm/syscalls.h> /* Post-execution fixups. */ @@ -308,6 +310,53 @@ static int uprobe_init_insn(struct arch_uprobe *auprobe, struct insn *insn, bool } #ifdef CONFIG_X86_64 + +asm ( + ".pushsection .rodata\n" + ".global uretprobe_syscall_entry\n" + "uretprobe_syscall_entry:\n" + "push %rax\n" + "mov $462, %rax\n" + "syscall\n" + ".global uretprobe_syscall_end\n" + "uretprobe_syscall_end:\n" + ".popsection\n" +); + +extern __visible u8 uretprobe_syscall_entry[]; +extern __visible u8 uretprobe_syscall_end[]; + +uprobe_opcode_t* arch_uprobe_trampoline(unsigned long *psize) +{ + *psize = uretprobe_syscall_end - uretprobe_syscall_entry; + return uretprobe_syscall_entry; +} + +SYSCALL_DEFINE1(uprobe, unsigned long, cmd) +{ + struct pt_regs *regs = task_pt_regs(current); + unsigned long ax, err; + + /* + * We get invoked from the trampoline that pushed rax + * value on stack, read and restore the value. + */ + err = copy_from_user((void*) &ax, (void *) regs->sp, sizeof(ax)); + WARN_ON_ONCE(err); + + regs->ax = ax; + regs->orig_ax = ax; + + /* + * And pop the stack back, because we jump to original + * return value. + */ + regs->sp = regs->sp + sizeof(regs->sp); + + uprobe_handle_trampoline(regs); + return ax; +} + /* * If arch_uprobe->insn doesn't use rip-relative addressing, return * immediately. Otherwise, rewrite the instruction so that it accesses diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 77eb9b0e7685..4f7d5b41b718 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -987,6 +987,8 @@ asmlinkage long sys_spu_run(int fd, __u32 __user *unpc, asmlinkage long sys_spu_create(const char __user *name, unsigned int flags, umode_t mode, int fd); +asmlinkage long sys_uprobe(unsigned long cmd); + /* * Deprecated system calls which are still defined in diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index f46e0ca0169c..9ef244c8ff19 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -138,6 +138,8 @@ extern bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs); extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, void *src, unsigned long len); +extern void uprobe_handle_trampoline(struct pt_regs *regs); +uprobe_opcode_t* arch_uprobe_trampoline(unsigned long *psize); #else /* !CONFIG_UPROBES */ struct uprobes_state { }; diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 75f00965ab15..2702799648e6 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -842,8 +842,11 @@ __SYSCALL(__NR_lsm_set_self_attr, sys_lsm_set_self_attr) #define __NR_lsm_list_modules 461 __SYSCALL(__NR_lsm_list_modules, sys_lsm_list_modules) +#define __NR_uprobe 462 +__SYSCALL(__NR_uprobe, sys_uprobe) + #undef __NR_syscalls -#define __NR_syscalls 462 +#define __NR_syscalls 463 /* * 32 bit systems traditionally used different diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 929e98c62965..fefaf4804e1f 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1474,10 +1474,19 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area) return ret; } +uprobe_opcode_t* __weak arch_uprobe_trampoline(unsigned long *psize) +{ + static uprobe_opcode_t insn = UPROBE_SWBP_INSN; + + *psize = UPROBE_SWBP_INSN_SIZE; + return &insn; +} + static struct xol_area *__create_xol_area(unsigned long vaddr) { struct mm_struct *mm = current->mm; - uprobe_opcode_t insn = UPROBE_SWBP_INSN; + unsigned long insns_size; + uprobe_opcode_t *insns; struct xol_area *area; area = kmalloc(sizeof(*area), GFP_KERNEL); @@ -1502,7 +1511,8 @@ static struct xol_area *__create_xol_area(unsigned long vaddr) /* Reserve the 1st slot for get_trampoline_vaddr() */ set_bit(0, area->bitmap); atomic_set(&area->slot_count, 1); - arch_uprobe_copy_ixol(area->pages[0], 0, &insn, UPROBE_SWBP_INSN_SIZE); + insns = arch_uprobe_trampoline(&insns_size); + arch_uprobe_copy_ixol(area->pages[0], 0, insns, insns_size); if (!xol_add_vma(mm, area)) return area; @@ -2123,7 +2133,7 @@ static struct return_instance *find_next_ret_chain(struct return_instance *ri) return ri; } -static void handle_trampoline(struct pt_regs *regs) +void uprobe_handle_trampoline(struct pt_regs *regs) { struct uprobe_task *utask; struct return_instance *ri, *next; @@ -2188,7 +2198,7 @@ static void handle_swbp(struct pt_regs *regs) bp_vaddr = uprobe_get_swbp_addr(regs); if (bp_vaddr == get_trampoline_vaddr()) - return handle_trampoline(regs); + return uprobe_handle_trampoline(regs); uprobe = find_active_uprobe(bp_vaddr, &is_swbp); if (!uprobe) { diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index faad00cce269..ddc954f28317 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -391,3 +391,5 @@ COND_SYSCALL(setuid16); /* restartable sequence */ COND_SYSCALL(rseq); + +COND_SYSCALL(uprobe);