On Wed, Jul 03, 2024 at 02:33:06PM -0700, Andrii Nakryiko wrote: > 2. More tactically, RCU protection seems like the best way forward. We > got hung up on SRCU vs RCU Tasks Trace. Thanks to Paul, we also > clarified that RCU Tasks Trace has nothing to do with Tasks Rude > flavor (whatever that is, I have no idea). > > Now, RCU Tasks Trace were specifically designed for least overhead > hotpath (reader side) performance, at the expense of slowing down much > rarer writers. My microbenchmarking does show at least 5% difference. > Both flavors can handle sleepable uprobes waiting for page faults. > Tasks Trace flavor is already used for tracing in the BPF realm, > including for sleepable uprobes and works well. It's not going away. I need to look into this new RCU flavour and why it exists -- for example, why can't SRCU be improved to gain the same benefits. This is what we've always done, improve SRCU. > Now, you keep pushing for SRCU instead of RCU Tasks Trace, but I > haven't seen a single argument why. Please provide that, or let's > stick to RCU Tasks Trace, because uprobe's use case is an ideal case > of what Tasks Trace flavor was designed for. Because I actually know SRCU, and because it provides a local scope. It isolates the unregister waiters from other random users. I'm not going to use this funky new flavour until I truly understand it. Also, we actually want two scopes here, there is no reason for the consumer unreg to wait for the retprobe stuff. > 3. Regardless of RCU flavor, due to RCU protection, we have to add > batched register/unregister APIs, so we can amortize sync_rcu cost > during deregistration. Can we please agree on that as well? This is > the main goal of this patch set and I'd like to land it before working > further on changing and improving the rest of the locking schema. See my patch here: https://lkml.kernel.org/r/20240704084524.GC28838@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx I don't think it needs to be more complicated than that. > I won't be happy about it, but just to move things forward, I can drop > a) custom refcounting and/or b) percpu RW semaphore. Both are > beneficial but not essential for batched APIs work. But if you force > me to do that, please state clearly your reasons/arguments. The reason I'm pushing RCU here is because AFAICT uprobes doesn't actually need the stronger serialisation that rwlock (any flavour) provide. It is a prime candidate for RCU, and I think you'll find plenty papers / articles (by both Paul and others) that show that RCU scales better. As a bonus, you avoid that horrific write side cost that per-cpu rwsem has. The reason I'm not keen on that refcount thing was initially because I did not understand the justification for it, but worse, once I did read your justification, your very own numbers convinced me that the refcount is fundamentally problematic, in any way shape or form. > No one had yet pointed out why refcounting is broken Your very own numbers point out that refcounting is a problem here. > and why percpu RW semaphore is bad. Literature and history show us that RCU -- where possible -- is always better than any reader-writer locking scheme. > 4. Another tactical thing, but an important one. Refcounting schema > for uprobes. I've replied already, but I think refcounting is > unavoidable for uretprobes, I think we can fix that, I replied here: https://lkml.kernel.org/r/20240704083152.GQ11386@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx > and current refcounting schema is > problematic for batched APIs due to race between finding uprobe and > there still being a possibility we'd need to undo all that and retry > again. Right, I've not looked too deeply at that, because I've not seen a reason to actually change that. I can go think about it if you want, but meh.