On Tue, Jul 2, 2024 at 12:19 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > On Tue, Jul 02, 2024 at 10:54:51AM -0700, Andrii Nakryiko wrote: > > > > @@ -593,6 +595,12 @@ static struct uprobe *get_uprobe(struct uprobe *uprobe) > > > return uprobe; > > > } > > > [...] > > > @@ -668,12 +677,25 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset) > > > static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) > > > { > > > struct uprobe *uprobe; > > > + unsigned seq; > > > > > > - read_lock(&uprobes_treelock); > > > - uprobe = __find_uprobe(inode, offset); > > > - read_unlock(&uprobes_treelock); > > > + guard(rcu)(); > > > > > > - return uprobe; > > > + do { > > > + seq = read_seqcount_begin(&uprobes_seqcount); > > > + uprobes = __find_uprobe(inode, offset); > > > + if (uprobes) { > > > + /* > > > + * Lockless RB-tree lookups are prone to false-negatives. > > > + * If they find something, it's good. If they do not find, > > > + * it needs to be validated. > > > + */ > > > + return uprobes; > > > + } > > > + } while (read_seqcount_retry(&uprobes_seqcount, seq)); > > > + > > > + /* Really didn't find anything. */ > > > + return NULL; > > > } > > > > Honest question here, as I don't understand the tradeoffs well enough. > > Is there a lot of benefit to switching to seqcount lock vs using > > percpu RW semaphore (previously recommended by Ingo). The latter is a > > nice drop-in replacement and seems to be very fast and scale well. > > As you noted, that percpu-rwsem write side is quite insane. And you're > creating this batch complexity to mitigate that. Note that batch API is needed regardless of percpu RW semaphore or not. As I mentioned, once uprobes_treelock is mitigated one way or the other, the next one is uprobe->register_rwsem. For scalability, we need to get rid of it and preferably not add any locking at all. So tentatively I'd like to have lockless RCU-protected iteration over uprobe->consumers list and call consumer->handler(). This means that on uprobes_unregister we'd need synchronize_rcu (for whatever RCU flavor we end up using), to ensure that we don't free uprobe_consumer memory from under handle_swbp() while it is actually triggering consumers. So, without batched unregistration we'll be back to the same problem I'm solving here: doing synchronize_rcu() for each attached uprobe one by one is prohibitively slow. We went through this exercise with ftrace/kprobes already and fixed it with batched APIs. Doing that for uprobes seems unavoidable as well. > > The patches you propose are quite complex, this alternative not so much. I agree that this custom refcounting is not trivial, but at least it's pretty well contained within two low-level helpers which are all used within this single .c file. On the other hand, it actually gives us a) speed and better scalability (I showed comparisons with refcount_inc_not_zero approach earlier, I believe) and b) it actually simplifies logic during registration (which is even more important aspect with batched API), where we don't need to handle uprobe suddenly going away after we already looked it up. I believe overall it's an improvement worth doing. > > > Right now we are bottlenecked on uprobe->register_rwsem (not > > uprobes_treelock anymore), which is currently limiting the scalability > > of uprobes and I'm going to work on that next once I'm done with this > > series. > > Right, but it looks fairly simple to replace that rwsem with a mutex and > srcu. srcu vs RCU Tasks Trace aside (which Paul addressed), see above about the need for batched API and synchronize_rcu().