hi, this patchset adds support to optimize usdt probes on top of 5-byte nop instruction. The generic approach (optimize all uprobes) is hard due to emulating possible multiple original instructions and its related issues. The usdt case, which stores 5-byte nop seems much easier, so starting with that. The basic idea is to replace breakpoint exception with syscall which is faster on x86_64. For more details please see changelog of patch 8. The run_bench_uprobes.sh benchmark triggers uprobe (on top of different original instructions) in a loop and counts how many of those happened per second (the unit below is million loops). There's big speed up if you consider current usdt implementation (uprobe-nop) compared to proposed usdt (uprobe-nop5): # ./benchs/run_bench_uprobes.sh usermode-count : 818.386 ± 1.886M/s syscall-count : 8.923 ± 0.003M/s --> uprobe-nop : 3.086 ± 0.005M/s uprobe-push : 2.751 ± 0.001M/s uprobe-ret : 1.481 ± 0.000M/s --> uprobe-nop5 : 4.016 ± 0.002M/s uretprobe-nop : 1.712 ± 0.008M/s uretprobe-push : 1.616 ± 0.001M/s uretprobe-ret : 1.052 ± 0.000M/s uretprobe-nop5 : 2.015 ± 0.000M/s rfc v2 changes: - make uretprobe work properly with optimized uprobe - make the uprobe optimized code x86_64 specific [Peter] - rework the verify function logic, using it now as callback - fix find_nearest_page to include [PAGE_SIZE, ... ] area [Andrii] - try lockless vma lookup in in_uprobe_trampoline [Peter] - do per partes instructions update using int3 like in text_poke_bp_batch [David] - map uprobe trampoline via single global page [Thomas] - keep track of uprobes per mm_struct pending todo (follow ups): - use PROCMAP_QUERY in tests - alloc 'struct uprobes_state' for mm_struct only when needed [Andrii] - seccomp change for new uprobe syscall (same as for uretprobe) thanks, jirka --- Jiri Olsa (18): uprobes: Rename arch_uretprobe_trampoline function uprobes: Make copy_from_page global uprobes: Move ref_ctr_offset update out of uprobe_write_opcode uprobes: Add uprobe_write function uprobes: Add nbytes argument to uprobe_write_opcode uprobes: Add orig argument to uprobe_write and uprobe_write_opcode uprobes: Add swbp argument to arch_uretprobe_hijack_return_addr uprobes/x86: Add uprobe syscall to speed up uprobe uprobes/x86: Add mapping for optimized uprobe trampolines uprobes/x86: Add mm_uprobe objects to track uprobes within mm uprobes/x86: Add support to emulate nop5 instruction uprobes/x86: Add support to optimize uprobes selftests/bpf: Reorg the uprobe_syscall test function selftests/bpf: Use 5-byte nop for x86 usdt probes selftests/bpf: Add uprobe/usdt syscall tests selftests/bpf: Add hit/attach/detach race optimized uprobe test selftests/bpf: Add uprobe syscall sigill signal test selftests/bpf: Add 5-byte nop uprobe trigger bench arch/arm/probes/uprobes/core.c | 4 +- arch/arm64/kernel/probes/uprobes.c | 2 +- arch/csky/kernel/probes/uprobes.c | 2 +- arch/loongarch/kernel/uprobes.c | 2 +- arch/mips/kernel/uprobes.c | 2 +- arch/powerpc/kernel/uprobes.c | 2 +- arch/riscv/kernel/probes/uprobes.c | 2 +- arch/s390/kernel/uprobes.c | 2 +- arch/sparc/kernel/uprobes.c | 2 +- arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/x86/include/asm/uprobes.h | 6 ++ arch/x86/kernel/uprobes.c | 530 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++- include/linux/syscalls.h | 2 + include/linux/uprobes.h | 23 +++- kernel/events/uprobes.c | 147 +++++++++++++++++-------- kernel/fork.c | 1 + kernel/sys_ni.c | 1 + tools/testing/selftests/bpf/bench.c | 12 +++ tools/testing/selftests/bpf/benchs/bench_trigger.c | 42 ++++++++ tools/testing/selftests/bpf/benchs/run_bench_uprobes.sh | 2 +- tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c | 342 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++--- tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c | 34 +++++- tools/testing/selftests/bpf/sdt.h | 9 +- 23 files changed, 1093 insertions(+), 79 deletions(-)