Re: [PATCH 00/10] perf/uprobe: Optimize uprobes

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Mon, 8 Jul 2024 17:25:14 -0700

On Mon, Jul 8, 2024 at 3:56 PM Masami Hiramatsu <mhiramat@xxxxxxxxxx> wrote:
>
> On Mon, 08 Jul 2024 11:12:41 +0200
> Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> > Hi!
> >
> > These patches implement the (S)RCU based proposal to optimize uprobes.
> >
> > On my c^Htrusty old IVB-EP -- where each (of the 40) CPU calls 'func' in a
> > tight loop:
> >
> >   perf probe -x ./uprobes test=func
> >   perf stat -ae probe_uprobe:test  -- sleep 1
> >
> >   perf probe -x ./uprobes test=func%return
> >   perf stat -ae probe_uprobe:test__return -- sleep 1
> >
> > PRE:
> >
> >   4,038,804      probe_uprobe:test
> >   2,356,275      probe_uprobe:test__return
> >
> > POST:
> >
> >   7,216,579      probe_uprobe:test
> >   6,744,786      probe_uprobe:test__return
> >
>
> Good results! So this is another series of Andrii's batch register?
> (but maybe it becomes simpler)

yes, this would be an alternative to my patches

Peter,

I didn't have time to look at the patches just yet, but I managed to
run a quick benchmark (using bench tool we have as part of BPF
selftests) to see both single-threaded performance and how the
performance scales with CPUs (now that we are not bottlenecked on
register_rwsem). Here are some results:

[root@kerneltest003.10.atn6 ~]# for num_threads in {1..20}; do ./bench \
-a -d10 -p $num_threads trig-uprobe-nop | grep Summary; done
Summary: hits    3.278 ± 0.021M/s (  3.278M/prod)
Summary: hits    4.364 ± 0.005M/s (  2.182M/prod)
Summary: hits    6.517 ± 0.011M/s (  2.172M/prod)
Summary: hits    8.203 ± 0.004M/s (  2.051M/prod)
Summary: hits    9.520 ± 0.012M/s (  1.904M/prod)
Summary: hits    8.316 ± 0.007M/s (  1.386M/prod)
Summary: hits    7.893 ± 0.037M/s (  1.128M/prod)
Summary: hits    8.490 ± 0.014M/s (  1.061M/prod)
Summary: hits    8.022 ± 0.005M/s (  0.891M/prod)
Summary: hits    8.471 ± 0.019M/s (  0.847M/prod)
Summary: hits    8.156 ± 0.021M/s (  0.741M/prod)
...

(numbers in the first column is total throughput, and xxx/prod is
per-thread throughput). Single-threaded performance (about 3.3 mln/s)
is on part with what I had with my patches. And clearly it scales
better with more thread now that register_rwsem is gone, though,
unfortunately, it doesn't really scale linearly.

Quick profiling for the 8-threaded benchmark shows that we spend >20%
in mmap_read_lock/mmap_read_unlock in find_active_uprobe. I think
that's what would prevent uprobes from scaling linearly. If you have
some good ideas on how to get rid of that, I think it would be
extremely beneficial. We also spend about 14% of the time in
srcu_read_lock(). The rest is in interrupt handling overhead, actual
user-space function overhead, and in uprobe_dispatcher() calls.

Ramping this up to 16 threads shows that mmap_rwsem is getting more
costly, up to 45% of CPU. SRCU is also growing a bit slower to 19% of
CPU. Is this expected? (I'm not familiar with the implementation
details)

P.S. Would you be able to rebase your patches on top of latest
probes/for-next, which include Jiri's sys_uretprobe changes. Right now
uretprobe benchmarks are quite unrepresentative because of that.
Thanks!

>
> Thank you,
>
> >
> > Patches also available here:
> >
> >   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/uprobes
> >
> >
>
>
> --
> Masami Hiramatsu (Google) <mhiramat@xxxxxxxxxx>