Re: [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Tue, 2 Jul 2024 21:47:41 -0700

On Tue, Jul 2, 2024 at 12:19 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Tue, Jul 02, 2024 at 10:54:51AM -0700, Andrii Nakryiko wrote:
>
> > > @@ -593,6 +595,12 @@ static struct uprobe *get_uprobe(struct uprobe *uprobe)
> > >         return uprobe;
> > >  }
> > >

[...]

> > > @@ -668,12 +677,25 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
> > >  static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
> > >  {
> > >         struct uprobe *uprobe;
> > > +       unsigned seq;
> > >
> > > -       read_lock(&uprobes_treelock);
> > > -       uprobe = __find_uprobe(inode, offset);
> > > -       read_unlock(&uprobes_treelock);
> > > +       guard(rcu)();
> > >
> > > -       return uprobe;
> > > +       do {
> > > +               seq = read_seqcount_begin(&uprobes_seqcount);
> > > +               uprobes = __find_uprobe(inode, offset);
> > > +               if (uprobes) {
> > > +                       /*
> > > +                        * Lockless RB-tree lookups are prone to false-negatives.
> > > +                        * If they find something, it's good. If they do not find,
> > > +                        * it needs to be validated.
> > > +                        */
> > > +                       return uprobes;
> > > +               }
> > > +       } while (read_seqcount_retry(&uprobes_seqcount, seq));
> > > +
> > > +       /* Really didn't find anything. */
> > > +       return NULL;
> > >  }
> >
> > Honest question here, as I don't understand the tradeoffs well enough.
> > Is there a lot of benefit to switching to seqcount lock vs using
> > percpu RW semaphore (previously recommended by Ingo). The latter is a
> > nice drop-in replacement and seems to be very fast and scale well.
>
> As you noted, that percpu-rwsem write side is quite insane. And you're
> creating this batch complexity to mitigate that.

Note that batch API is needed regardless of percpu RW semaphore or
not. As I mentioned, once uprobes_treelock is mitigated one way or the
other, the next one is uprobe->register_rwsem. For scalability, we
need to get rid of it and preferably not add any locking at all. So
tentatively I'd like to have lockless RCU-protected iteration over
uprobe->consumers list and call consumer->handler(). This means that
on uprobes_unregister we'd need synchronize_rcu (for whatever RCU
flavor we end up using), to ensure that we don't free uprobe_consumer
memory from under handle_swbp() while it is actually triggering
consumers.

So, without batched unregistration we'll be back to the same problem
I'm solving here: doing synchronize_rcu() for each attached uprobe one
by one is prohibitively slow. We went through this exercise with
ftrace/kprobes already and fixed it with batched APIs. Doing that for
uprobes seems unavoidable as well.

>
> The patches you propose are quite complex, this alternative not so much.

I agree that this custom refcounting is not trivial, but at least it's
pretty well contained within two low-level helpers which are all used
within this single .c file.

On the other hand, it actually gives us a) speed and better
scalability (I showed comparisons with refcount_inc_not_zero approach
earlier, I believe) and b) it actually simplifies logic during
registration (which is even more important aspect with batched API),
where we don't need to handle uprobe suddenly going away after we
already looked it up.

I believe overall it's an improvement worth doing.

>
> > Right now we are bottlenecked on uprobe->register_rwsem (not
> > uprobes_treelock anymore), which is currently limiting the scalability
> > of uprobes and I'm going to work on that next once I'm done with this
> > series.
>
> Right, but it looks fairly simple to replace that rwsem with a mutex and
> srcu.

srcu vs RCU Tasks Trace aside (which Paul addressed), see above about
the need for batched API and synchronize_rcu().