On Mon, Jul 8, 2024 at 3:56 PM Masami Hiramatsu <mhiramat@xxxxxxxxxx> wrote: > > On Mon, 08 Jul 2024 11:12:41 +0200 > Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > > Hi! > > > > These patches implement the (S)RCU based proposal to optimize uprobes. > > > > On my c^Htrusty old IVB-EP -- where each (of the 40) CPU calls 'func' in a > > tight loop: > > > > perf probe -x ./uprobes test=func > > perf stat -ae probe_uprobe:test -- sleep 1 > > > > perf probe -x ./uprobes test=func%return > > perf stat -ae probe_uprobe:test__return -- sleep 1 > > > > PRE: > > > > 4,038,804 probe_uprobe:test > > 2,356,275 probe_uprobe:test__return > > > > POST: > > > > 7,216,579 probe_uprobe:test > > 6,744,786 probe_uprobe:test__return > > > > Good results! So this is another series of Andrii's batch register? > (but maybe it becomes simpler) yes, this would be an alternative to my patches Peter, I didn't have time to look at the patches just yet, but I managed to run a quick benchmark (using bench tool we have as part of BPF selftests) to see both single-threaded performance and how the performance scales with CPUs (now that we are not bottlenecked on register_rwsem). Here are some results: [root@kerneltest003.10.atn6 ~]# for num_threads in {1..20}; do ./bench \ -a -d10 -p $num_threads trig-uprobe-nop | grep Summary; done Summary: hits 3.278 ± 0.021M/s ( 3.278M/prod) Summary: hits 4.364 ± 0.005M/s ( 2.182M/prod) Summary: hits 6.517 ± 0.011M/s ( 2.172M/prod) Summary: hits 8.203 ± 0.004M/s ( 2.051M/prod) Summary: hits 9.520 ± 0.012M/s ( 1.904M/prod) Summary: hits 8.316 ± 0.007M/s ( 1.386M/prod) Summary: hits 7.893 ± 0.037M/s ( 1.128M/prod) Summary: hits 8.490 ± 0.014M/s ( 1.061M/prod) Summary: hits 8.022 ± 0.005M/s ( 0.891M/prod) Summary: hits 8.471 ± 0.019M/s ( 0.847M/prod) Summary: hits 8.156 ± 0.021M/s ( 0.741M/prod) ... (numbers in the first column is total throughput, and xxx/prod is per-thread throughput). Single-threaded performance (about 3.3 mln/s) is on part with what I had with my patches. And clearly it scales better with more thread now that register_rwsem is gone, though, unfortunately, it doesn't really scale linearly. Quick profiling for the 8-threaded benchmark shows that we spend >20% in mmap_read_lock/mmap_read_unlock in find_active_uprobe. I think that's what would prevent uprobes from scaling linearly. If you have some good ideas on how to get rid of that, I think it would be extremely beneficial. We also spend about 14% of the time in srcu_read_lock(). The rest is in interrupt handling overhead, actual user-space function overhead, and in uprobe_dispatcher() calls. Ramping this up to 16 threads shows that mmap_rwsem is getting more costly, up to 45% of CPU. SRCU is also growing a bit slower to 19% of CPU. Is this expected? (I'm not familiar with the implementation details) P.S. Would you be able to rebase your patches on top of latest probes/for-next, which include Jiri's sys_uretprobe changes. Right now uretprobe benchmarks are quite unrepresentative because of that. Thanks! > > Thank you, > > > > > Patches also available here: > > > > git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/uprobes > > > > > > > -- > Masami Hiramatsu (Google) <mhiramat@xxxxxxxxxx>