On Mon, Jul 1, 2024 at 3:39 PM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote: > > This patch set, ultimately, switches global uprobes_treelock from RW spinlock > to per-CPU RW semaphore, which has better performance and scales better under > contention and multiple parallel threads triggering lots of uprobes. > > To make this work well with attaching multiple uprobes (through BPF > multi-uprobe), we need to add batched versions of uprobe register/unregister > APIs. This is what most of the patch set is actually doing. The actual switch > to per-CPU RW semaphore is trivial after that and is done in the very last > patch #12. See commit message with some comparison numbers. > Peter, I think I've addressed all the questions so far, but I wanted to take a moment and bring all the discussions into a single palace, summarize what I think are the main points of contention and hopefully make some progress, or at least get us to a bit more constructive discussion where *both sides* provide arguments. Right now there is a lot of "you are doing X, but why don't you just do Y" with no argument for a) why X is bad/wrong/inferior and b) why Y is better (and not just equivalent or, even worse, inferior). I trust you have the best intentions in mind for this piece of kernel infrastructure, so do I, so let's try to find a path forward. 1. Strategically, uprobes/uretprobes have to be improved. Customers do complain more and more that "uprobes are slow", justifiably so. Both single-threaded performance matters, but also, critically, uprobes scalability. I.e., if the kernel can handle N uprobe per second on a single uncontended CPU, then triggering uprobes across M CPUs should, ideally and roughly, give us about N * M total throughput. This doesn't seem controversial, but I wanted to make it clear that this is the end goal of my work. And no, this patch set alone doesn't, yet, get us there. But it's a necessary step, IMO. Jiri Olsa took single-threaded performance and is improving it with sys_uretprobe and soon sys_uprobe, I'm looking into scalability and other smaller single-threaded wins, where possible. 2. More tactically, RCU protection seems like the best way forward. We got hung up on SRCU vs RCU Tasks Trace. Thanks to Paul, we also clarified that RCU Tasks Trace has nothing to do with Tasks Rude flavor (whatever that is, I have no idea). Now, RCU Tasks Trace were specifically designed for least overhead hotpath (reader side) performance, at the expense of slowing down much rarer writers. My microbenchmarking does show at least 5% difference. Both flavors can handle sleepable uprobes waiting for page faults. Tasks Trace flavor is already used for tracing in the BPF realm, including for sleepable uprobes and works well. It's not going away. Now, you keep pushing for SRCU instead of RCU Tasks Trace, but I haven't seen a single argument why. Please provide that, or let's stick to RCU Tasks Trace, because uprobe's use case is an ideal case of what Tasks Trace flavor was designed for. 3. Regardless of RCU flavor, due to RCU protection, we have to add batched register/unregister APIs, so we can amortize sync_rcu cost during deregistration. Can we please agree on that as well? This is the main goal of this patch set and I'd like to land it before working further on changing and improving the rest of the locking schema. I won't be happy about it, but just to move things forward, I can drop a) custom refcounting and/or b) percpu RW semaphore. Both are beneficial but not essential for batched APIs work. But if you force me to do that, please state clearly your reasons/arguments. No one had yet pointed out why refcounting is broken and why percpu RW semaphore is bad. On the contrary, Ingo Molnar did suggest percpu RW semaphore in the first place (see [0]), but we postponed it due to the lack of batched APIs, and promised to do this work. Here I am, doing the promised work. Not purely because of percpu RW semaphore, but benefiting from it just as well. [0] https://lore.kernel.org/linux-trace-kernel/Zf+d9twfyIDosINf@xxxxxxxxx/ 4. Another tactical thing, but an important one. Refcounting schema for uprobes. I've replied already, but I think refcounting is unavoidable for uretprobes, and current refcounting schema is problematic for batched APIs due to race between finding uprobe and there still being a possibility we'd need to undo all that and retry again. I think the main thing is to agree to change refcounting to avoid this race, allowing for simpler batched registration. Hopefully we can agree on that. But also, refcount_inc_not_zero() which is another limiting factor for scalability (see above about the end goal of scalability) vs atomic64_add()-based epoch+refcount approach I took, which is noticeably better on x86-64, and I don't think hurts any other architecture, to say the least. I think the latter could be generalized as an alternative flavor of refcount_t, but I'd prefer to land it in uprobes in current shape, and if we think it's a good idea to generalize, we can always do that refactoring once things stabilize a bit. You seem to have problems with the refcounting implementation I did (besides overflow detection, which I'll address in the next revision, so not a problem). My arguments are a) performance and b) it's well contained within get/put helpers and doesn't leak outside of them *at all*, while providing a nice always successful get_uprobe() primitive. Can I please hear the arguments for not doing it, besides "Everyone is using refcount_inc_not_zero", which isn't much of a reason (we'd never do anything novel in the kernel if that was a good enough reason to not do something new). Again, thanks for engagement, I do appreciate it. But let's try to move this forward. Thanks!