This patch set, ultimately, switches global uprobes_treelock from RW spinlock to per-CPU RW semaphore, which has better performance and scales better under contention and multiple parallel threads triggering lots of uprobes. To make this work well with attaching multiple uprobes (through BPF multi-uprobe), we need to add batched versions of uprobe register/unregister APIs. This is what most of the patch set is actually doing. The actual switch to per-CPU RW semaphore is trivial after that and is done in the very last patch #12. See commit message with some comparison numbers. Patch #4 is probably the most important patch in the series, revamping uprobe lifetime management and refcounting. See patch description and added code comments for all the details. With changes in patch #4, we open up the way to refactor uprobe_register() and uprobe_unregister() implementations in such a way that we can avoid taking uprobes_treelock many times during a single batched attachment/detachment. This allows to accommodate a much higher latency of taking per-CPU RW semaphore for write. The end result of this patch set is that attaching 50 thousand uprobes with BPF multi-uprobes doesn't regress and takes about 200ms both before and after the changes in this patch set. Patch #5 updates existing uprobe consumers to put all the relevant necessary pieces into struct uprobe_consumer, without having to pass around offset/ref_ctr_offset. Existing consumers already keep this data around, we just formalize the interface. Patches #6 through #10 add batched versions of register/unregister APIs and gradually factor them in such a way as to allow taking single (batched) uprobes_treelock, splitting the logic into multiple independent phases. Patch #11 switched BPF multi-uprobes to batched uprobe APIs. As mentioned, a very straightforward patch #12 takes advantage of all the prep work and just switches uprobes_treelock to per-CPU RW semaphore. Andrii Nakryiko (12): uprobes: update outdated comment uprobes: grab write mmap lock in unapply_uprobe() uprobes: simplify error handling for alloc_uprobe() uprobes: revamp uprobe refcounting and lifetime management uprobes: move offset and ref_ctr_offset into uprobe_consumer uprobes: add batch uprobe register/unregister APIs uprobes: inline alloc_uprobe() logic into __uprobe_register() uprobes: split uprobe allocation and uprobes_tree insertion steps uprobes: batch uprobes_treelock during registration uprobes: improve lock batching for uprobe_unregister_batch uprobes,bpf: switch to batch uprobe APIs for BPF multi-uprobes uprobes: switch uprobes_treelock to per-CPU RW semaphore include/linux/uprobes.h | 29 +- kernel/events/uprobes.c | 522 ++++++++++++------ kernel/trace/bpf_trace.c | 40 +- kernel/trace/trace_uprobe.c | 53 +- .../selftests/bpf/bpf_testmod/bpf_testmod.c | 23 +- 5 files changed, 419 insertions(+), 248 deletions(-) -- 2.43.0