On Mon, Mar 25, 2024 at 12:12 PM Jonthan Haslam <jonathan.haslam@xxxxxxxxx> wrote: > > Hi Ingo, > > > > This change has been tested against production workloads that exhibit > > > significant contention on the spinlock and an almost order of magnitude > > > reduction for mean uprobe execution time is observed (28 -> 3.5 microsecs). > > > > Have you considered/measured per-CPU RW semaphores? > > No I hadn't but thanks hugely for suggesting it! In initial measurements > it seems to be between 20-100% faster than the RW spinlocks! Apologies for > all the exclamation marks but I'm very excited. I'll do some more testing > tomorrow but so far it's looking very good. > Documentation ([0]) says that locking for writing calls synchronize_rcu(), is that right? If that's true, attaching multiple uprobes (including just attaching a single BPF multi-uprobe) will take a really long time. We need to confirm we are not significantly regressing this. And if we do, we need to take measures in the BPF multi-uprobe attachment code path to make sure that a single multi-uprobe attachment is still fast. If my worries above turn out to be true, it still feels like a first good step should be landing this patch as is (and get it backported to older kernels), and then have percpu rw-semaphore as a final (and a bit more invasive) solution (it's RCU-based, so feels like a good primitive to settle on), making sure to not regress multi-uprobes (we'll probably will need some batched API for multiple uprobes). Thoughts? [0] https://docs.kernel.org/locking/percpu-rw-semaphore.html > Thanks again for the input. > > Jon.