On Tue, Jul 02, 2024 at 10:54:51AM -0700, Andrii Nakryiko wrote: > > @@ -593,6 +595,12 @@ static struct uprobe *get_uprobe(struct uprobe *uprobe) > > return uprobe; > > } > > > > +static void uprobe_free_rcu(struct rcu_head *rcu) > > +{ > > + struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu); > > + kfree(uprobe); > > +} > > + > > static void put_uprobe(struct uprobe *uprobe) > > { > > if (refcount_dec_and_test(&uprobe->ref)) { > > @@ -604,7 +612,8 @@ static void put_uprobe(struct uprobe *uprobe) > > right above this we have roughly this: > > percpu_down_write(&uprobes_treelock); > > /* refcount check */ > rb_erase(&uprobe->rb_node, &uprobes_tree); > > percpu_up_write(&uprobes_treelock); > > > This writer lock is necessary for modification of the RB tree. And I > was under impression that I shouldn't be doing > percpu_(down|up)_write() inside the normal > rcu_read_lock()/rcu_read_unlock() region (percpu_down_write has > might_sleep() in it). But maybe I'm wrong, hopefully Paul can help to > clarify. preemptible RCU or SRCU would work. > > But actually what's wrong with RCU Tasks Trace flavor? Paul, isn't this the RCU flavour you created to deal with !rcu_is_watching()? The flavour that never should have been created in favour of just cleaning up the mess instead of making more. > I will > ultimately use it anyway to avoid uprobe taking unnecessary refcount > and to protect uprobe->consumers iteration and uc->handler() calls, > which could be sleepable, so would need rcu_read_lock_trace(). I don't think you need trace-rcu for that. SRCU would do nicely I think. > > mutex_lock(&delayed_uprobe_lock); > > delayed_uprobe_remove(uprobe, NULL); > > mutex_unlock(&delayed_uprobe_lock); > > - kfree(uprobe); > > + > > + call_rcu(&uprobe->rcu, uprobe_free_rcu); > > } > > } > > > > @@ -668,12 +677,25 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset) > > static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) > > { > > struct uprobe *uprobe; > > + unsigned seq; > > > > - read_lock(&uprobes_treelock); > > - uprobe = __find_uprobe(inode, offset); > > - read_unlock(&uprobes_treelock); > > + guard(rcu)(); > > > > - return uprobe; > > + do { > > + seq = read_seqcount_begin(&uprobes_seqcount); > > + uprobes = __find_uprobe(inode, offset); > > + if (uprobes) { > > + /* > > + * Lockless RB-tree lookups are prone to false-negatives. > > + * If they find something, it's good. If they do not find, > > + * it needs to be validated. > > + */ > > + return uprobes; > > + } > > + } while (read_seqcount_retry(&uprobes_seqcount, seq)); > > + > > + /* Really didn't find anything. */ > > + return NULL; > > } > > Honest question here, as I don't understand the tradeoffs well enough. > Is there a lot of benefit to switching to seqcount lock vs using > percpu RW semaphore (previously recommended by Ingo). The latter is a > nice drop-in replacement and seems to be very fast and scale well. As you noted, that percpu-rwsem write side is quite insane. And you're creating this batch complexity to mitigate that. The patches you propose are quite complex, this alternative not so much. > Right now we are bottlenecked on uprobe->register_rwsem (not > uprobes_treelock anymore), which is currently limiting the scalability > of uprobes and I'm going to work on that next once I'm done with this > series. Right, but it looks fairly simple to replace that rwsem with a mutex and srcu.