Re: [PATCH] uprobes: reduce contention on uprobes_tree access

Jonthan Haslam <jonathan.haslam@xxxxxxxxx> · Thu, 4 Apr 2024 11:45:55 +0100

> > Things to note about the results:
> >
> > - The results are slightly variable so don't get too caught up on
> >   individual thread count - it's the trend that is important.
> > - In terms of throughput with this specific benchmark a *very* macro view
> >   is that the RW spinlock provides 40-60% more throughput than the
> >   spinlock.  The per-CPU RW semaphore provides in the order of 50-100%
> >   more throughput then the spinlock.
> > - This doesn't fully reflect the large reduction in latency that we have
> >   seen in application based measurements. However, it does demonstrate
> >   that even the trivial change of going to a RW spinlock provides
> >   significant benefits.
> 
> This is probably because trig-uprobe-nop creates a single uprobe that
> is triggered on many CPUs. While in production we have also *many*
> uprobes running on many CPUs. In this benchmark, besides contention on
> uprobes_treelock, we are also hammering on other per-uprobe locks
> (register_rwsem, also if you don't have [0] patch locally, there will
> be another filter lock taken each time, filter->rwlock). There is also
> atomic refcounting going on, which when you have the same uprobe
> across all CPUs at the same time will cause a bunch of cache line
> bouncing.
> 
> So yes, it's understandable that in practice in production you see an
> even larger effect of optimizing uprobe_treelock than in this
> micro-benchmark.
> 
>   [0] https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/commit/?h=probes/for-next&id=366f7afd3de31d3ce2f4cbff97c6c23b6aa6bcdf

Thanks for the reply and the thoughts on this Andrii. Yes, I do have the
filter->rwlock fix applied but, as you say, there are no doubt other
effects at play here as to be expected in such a synthetic workload. I'm
pleased with the outcomes though as they show a good result even if they
are at the lower end of what I expect.

The results also show that pursuing an RCU solution is definitely worth it
but that write penalty is brutal in the case of a full synchronize_rcu()!
Should be fun.

> > for num_threads in {1..20}
> > do
> >     sudo ./bench -p $num_threads trig-uprobe-nop | grep Summary
> 
> just want to mention -a (affinity) option that you can pass a bench
> tool, it will pin each thread on its own CPU. It generally makes tests
> more uniform, eliminating CPU migrations variability.

Thanks for pointing that flag  out!

Jon.

> 
> > done
> >
> >
> > spinlock
> >
> > Summary: hits    1.453 ± 0.005M/s (  1.453M/prod)
> > Summary: hits    2.087 ± 0.005M/s (  1.043M/prod)
> > Summary: hits    2.701 ± 0.012M/s (  0.900M/prod)
> 
> I also wanted to point out that the first measurement (1.453M/s in
> this row) is total throughput across all threads, while value in
> parenthesis (0.900M/prod) is averaged throughput per each thread. So
> this M/prod value is the most interesting in this benchmark where we
> assess the effect of reducing contention.
> 
> > Summary: hits    1.917 ± 0.011M/s (  0.479M/prod)
> > Summary: hits    2.105 ± 0.003M/s (  0.421M/prod)
> > Summary: hits    1.615 ± 0.006M/s (  0.269M/prod)
> 
> [...]