Re: [PATCH] uprobes: reduce contention on uprobes_tree access

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Wed, 3 Apr 2024 10:50:04 -0700

On Wed, Apr 3, 2024 at 4:05 AM Jonthan Haslam <jonathan.haslam@xxxxxxxxx> wrote:
>
> > > > > Given the discussion around per-cpu rw semaphore and need for
> > > > > (internal) batched attachment API for uprobes, do you think you can
> > > > > apply this patch as is for now? We can then gain initial improvements
> > > > > in scalability that are also easy to backport, and Jonathan will work
> > > > > on a more complete solution based on per-cpu RW semaphore, as
> > > > > suggested by Ingo.
> > > >
> > > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe.
> > > > I would like to wait for the next version.
> > >
> > > My initial tests show a nice improvement on the over RW spinlocks but
> > > significant regression in acquiring a write lock. I've got a few days
> > > vacation over Easter but I'll aim to get some more formalised results out
> > > to the thread toward the end of next week.
> >
> > As far as the write lock is only on the cold path, I think you can choose
> > per-cpu RW semaphore. Since it does not do busy wait, the total system
> > performance impact will be small.
> > I look forward to your formalized results :)
>
> Sorry for the delay in getting back to you on this Masami.
>
> I have used one of the bpf selftest benchmarks to provide some form of
> comparison of the 3 different approaches (spinlock, RW spinlock and
> per-cpu RW semaphore). The benchmark used here is the 'trig-uprobe-nop'
> benchmark which just executes a single uprobe with a minimal bpf program
> attached. The tests were done on a 32 core qemu/kvm instance.
>

Thanks a lot for running benchmarks and providing results!

> Things to note about the results:
>
> - The results are slightly variable so don't get too caught up on
>   individual thread count - it's the trend that is important.
> - In terms of throughput with this specific benchmark a *very* macro view
>   is that the RW spinlock provides 40-60% more throughput than the
>   spinlock.  The per-CPU RW semaphore provides in the order of 50-100%
>   more throughput then the spinlock.
> - This doesn't fully reflect the large reduction in latency that we have
>   seen in application based measurements. However, it does demonstrate
>   that even the trivial change of going to a RW spinlock provides
>   significant benefits.

This is probably because trig-uprobe-nop creates a single uprobe that
is triggered on many CPUs. While in production we have also *many*
uprobes running on many CPUs. In this benchmark, besides contention on
uprobes_treelock, we are also hammering on other per-uprobe locks
(register_rwsem, also if you don't have [0] patch locally, there will
be another filter lock taken each time, filter->rwlock). There is also
atomic refcounting going on, which when you have the same uprobe
across all CPUs at the same time will cause a bunch of cache line
bouncing.

So yes, it's understandable that in practice in production you see an
even larger effect of optimizing uprobe_treelock than in this
micro-benchmark.

  [0] https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/commit/?h=probes/for-next&id=366f7afd3de31d3ce2f4cbff97c6c23b6aa6bcdf

>
> I haven't included the measurements on per-CPU RW semaphore write
> performance as they are completely in line with those that Paul McKenney
> posted on his journal [0]. On a 32 core system I see semaphore writes to
> take in the order of 25-28 millisecs - the cost of the synchronize_rcu().
>
> Each block of results below show 1 line per execution of the benchmark (the
> "Summary" line) and each line is a run with one more thread added - a
> thread is a "producer". The lines are edited to remove extraneous output
> that adds no value here.
>
> The tests were executed with this driver script:
>
> for num_threads in {1..20}
> do
>     sudo ./bench -p $num_threads trig-uprobe-nop | grep Summary

just want to mention -a (affinity) option that you can pass a bench
tool, it will pin each thread on its own CPU. It generally makes tests
more uniform, eliminating CPU migrations variability.

> done
>
>
> spinlock
>
> Summary: hits    1.453 ± 0.005M/s (  1.453M/prod)
> Summary: hits    2.087 ± 0.005M/s (  1.043M/prod)
> Summary: hits    2.701 ± 0.012M/s (  0.900M/prod)

I also wanted to point out that the first measurement (1.453M/s in
this row) is total throughput across all threads, while value in
parenthesis (0.900M/prod) is averaged throughput per each thread. So
this M/prod value is the most interesting in this benchmark where we
assess the effect of reducing contention.

> Summary: hits    1.917 ± 0.011M/s (  0.479M/prod)
> Summary: hits    2.105 ± 0.003M/s (  0.421M/prod)
> Summary: hits    1.615 ± 0.006M/s (  0.269M/prod)

[...]