Re: [PATCH] uprobes: reduce contention on uprobes_tree access

Jonthan Haslam <jonathan.haslam@xxxxxxxxx> · Tue, 26 Mar 2024 11:55:12 +0000

> > > Have you considered/measured per-CPU RW semaphores?
> >
> > No I hadn't but thanks hugely for suggesting it! In initial measurements
> > it seems to be between 20-100% faster than the RW spinlocks! Apologies for
> > all the exclamation marks but I'm very excited. I'll do some more testing
> > tomorrow but so far it's looking very good.
> >
> 
> Documentation ([0]) says that locking for writing calls
> synchronize_rcu(), is that right? If that's true, attaching multiple
> uprobes (including just attaching a single BPF multi-uprobe) will take
> a really long time. We need to confirm we are not significantly
> regressing this. And if we do, we need to take measures in the BPF
> multi-uprobe attachment code path to make sure that a single
> multi-uprobe attachment is still fast.
> 
> If my worries above turn out to be true, it still feels like a first
> good step should be landing this patch as is (and get it backported to
> older kernels), and then have percpu rw-semaphore as a final (and a
> bit more invasive) solution (it's RCU-based, so feels like a good
> primitive to settle on), making sure to not regress multi-uprobes
> (we'll probably will need some batched API for multiple uprobes).
> 
> Thoughts?

Agreed. In the percpu_down_write() path we call rcu_sync_enter() which is
what calls into synchronize_rcu(). I haven't done the measurements yet but
I would imagine this has to regress probe attachment, at least in the
uncontended case. Of course, reads are by far the dominant mode here but
we probably shouldn't punish writes excessively. I will do some
measurements to quantify the write penalty here.

I agree that a batched interface for probe attachment is needed here. The
usual mode of operation for us is that we have a number of USDTs (uprobes)
in hand and we want to enable and disable them in one shot. Removing the
need to do multiple locking operations is definitely an efficiency
improvement that needs to be done. Tie that together with per-CPU RW
semaphores and this should scale extremely well in both a read and write
case.

Jon.