Re: [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore

Peter Zijlstra <peterz@xxxxxxxxxxxxx> · Thu, 4 Jul 2024 11:15:59 +0200

On Wed, Jul 03, 2024 at 02:33:06PM -0700, Andrii Nakryiko wrote:

> 2. More tactically, RCU protection seems like the best way forward. We
> got hung up on SRCU vs RCU Tasks Trace. Thanks to Paul, we also
> clarified that RCU Tasks Trace has nothing to do with Tasks Rude
> flavor (whatever that is, I have no idea).
> 
> Now, RCU Tasks Trace were specifically designed for least overhead
> hotpath (reader side) performance, at the expense of slowing down much
> rarer writers. My microbenchmarking does show at least 5% difference.
> Both flavors can handle sleepable uprobes waiting for page faults.
> Tasks Trace flavor is already used for tracing in the BPF realm,
> including for sleepable uprobes and works well. It's not going away.

I need to look into this new RCU flavour and why it exists -- for
example, why can't SRCU be improved to gain the same benefits. This is
what we've always done, improve SRCU.

> Now, you keep pushing for SRCU instead of RCU Tasks Trace, but I
> haven't seen a single argument why. Please provide that, or let's
> stick to RCU Tasks Trace, because uprobe's use case is an ideal case
> of what Tasks Trace flavor was designed for.

Because I actually know SRCU, and because it provides a local scope.
It isolates the unregister waiters from other random users. I'm not
going to use this funky new flavour until I truly understand it.

Also, we actually want two scopes here, there is no reason for the
consumer unreg to wait for the retprobe stuff.

> 3. Regardless of RCU flavor, due to RCU protection, we have to add
> batched register/unregister APIs, so we can amortize sync_rcu cost
> during deregistration. Can we please agree on that as well? This is
> the main goal of this patch set and I'd like to land it before working
> further on changing and improving the rest of the locking schema.

See my patch here:

  https://lkml.kernel.org/r/20240704084524.GC28838@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

I don't think it needs to be more complicated than that.

> I won't be happy about it, but just to move things forward, I can drop
> a) custom refcounting and/or b) percpu RW semaphore. Both are
> beneficial but not essential for batched APIs work. But if you force
> me to do that, please state clearly your reasons/arguments.

The reason I'm pushing RCU here is because AFAICT uprobes doesn't
actually need the stronger serialisation that rwlock (any flavour)
provide. It is a prime candidate for RCU, and I think you'll find plenty
papers / articles (by both Paul and others) that show that RCU scales
better.

As a bonus, you avoid that horrific write side cost that per-cpu rwsem
has.

The reason I'm not keen on that refcount thing was initially because I
did not understand the justification for it, but worse, once I did read
your justification, your very own numbers convinced me that the refcount
is fundamentally problematic, in any way shape or form.

> No one had yet pointed out why refcounting is broken 

Your very own numbers point out that refcounting is a problem here. 

> and why percpu RW semaphore is bad. 

Literature and history show us that RCU -- where possible -- is
always better than any reader-writer locking scheme.

> 4. Another tactical thing, but an important one. Refcounting schema
> for uprobes. I've replied already, but I think refcounting is
> unavoidable for uretprobes,

I think we can fix that, I replied here:

  https://lkml.kernel.org/r/20240704083152.GQ11386@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

> and current refcounting schema is
> problematic for batched APIs due to race between finding uprobe and
> there still being a possibility we'd need to undo all that and retry
> again.

Right, I've not looked too deeply at that, because I've not seen a
reason to actually change that. I can go think about it if you want, but
meh.