Re: [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Wed, 3 Jul 2024 14:33:06 -0700

On Mon, Jul 1, 2024 at 3:39 PM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote:
>
> This patch set, ultimately, switches global uprobes_treelock from RW spinlock
> to per-CPU RW semaphore, which has better performance and scales better under
> contention and multiple parallel threads triggering lots of uprobes.
>
> To make this work well with attaching multiple uprobes (through BPF
> multi-uprobe), we need to add batched versions of uprobe register/unregister
> APIs. This is what most of the patch set is actually doing. The actual switch
> to per-CPU RW semaphore is trivial after that and is done in the very last
> patch #12. See commit message with some comparison numbers.
>

Peter,

I think I've addressed all the questions so far, but I wanted to take
a moment and bring all the discussions into a single palace, summarize
what I think are the main points of contention and hopefully make some
progress, or at least get us to a bit more constructive discussion
where *both sides* provide arguments. Right now there is a lot of "you
are doing X, but why don't you just do Y" with no argument for a) why
X is bad/wrong/inferior and b) why Y is better (and not just
equivalent or, even worse, inferior).

I trust you have the best intentions in mind for this piece of kernel
infrastructure, so do I, so let's try to find a path forward.

1. Strategically, uprobes/uretprobes have to be improved. Customers do
complain more and more that "uprobes are slow", justifiably so. Both
single-threaded performance matters, but also, critically, uprobes
scalability. I.e., if the kernel can handle N uprobe per second on a
single uncontended CPU, then triggering uprobes across M CPUs should,
ideally and roughly, give us about N * M total throughput.

This doesn't seem controversial, but I wanted to make it clear that
this is the end goal of my work. And no, this patch set alone doesn't,
yet, get us there. But it's a necessary step, IMO. Jiri Olsa took
single-threaded performance and is improving it with sys_uretprobe and
soon sys_uprobe, I'm looking into scalability and other smaller
single-threaded wins, where possible.

2. More tactically, RCU protection seems like the best way forward. We
got hung up on SRCU vs RCU Tasks Trace. Thanks to Paul, we also
clarified that RCU Tasks Trace has nothing to do with Tasks Rude
flavor (whatever that is, I have no idea).

Now, RCU Tasks Trace were specifically designed for least overhead
hotpath (reader side) performance, at the expense of slowing down much
rarer writers. My microbenchmarking does show at least 5% difference.
Both flavors can handle sleepable uprobes waiting for page faults.
Tasks Trace flavor is already used for tracing in the BPF realm,
including for sleepable uprobes and works well. It's not going away.

Now, you keep pushing for SRCU instead of RCU Tasks Trace, but I
haven't seen a single argument why. Please provide that, or let's
stick to RCU Tasks Trace, because uprobe's use case is an ideal case
of what Tasks Trace flavor was designed for.

3. Regardless of RCU flavor, due to RCU protection, we have to add
batched register/unregister APIs, so we can amortize sync_rcu cost
during deregistration. Can we please agree on that as well? This is
the main goal of this patch set and I'd like to land it before working
further on changing and improving the rest of the locking schema.

I won't be happy about it, but just to move things forward, I can drop
a) custom refcounting and/or b) percpu RW semaphore. Both are
beneficial but not essential for batched APIs work. But if you force
me to do that, please state clearly your reasons/arguments. No one had
yet pointed out why refcounting is broken and why percpu RW semaphore
is bad. On the contrary, Ingo Molnar did suggest percpu RW semaphore
in the first place (see [0]), but we postponed it due to the lack of
batched APIs, and promised to do this work. Here I am, doing the
promised work. Not purely because of percpu RW semaphore, but
benefiting from it just as well.

  [0] https://lore.kernel.org/linux-trace-kernel/Zf+d9twfyIDosINf@xxxxxxxxx/

4. Another tactical thing, but an important one. Refcounting schema
for uprobes. I've replied already, but I think refcounting is
unavoidable for uretprobes, and current refcounting schema is
problematic for batched APIs due to race between finding uprobe and
there still being a possibility we'd need to undo all that and retry
again.

I think the main thing is to agree to change refcounting to avoid this
race, allowing for simpler batched registration. Hopefully we can
agree on that.

But also, refcount_inc_not_zero() which is another limiting factor for
scalability (see above about the end goal of scalability) vs
atomic64_add()-based epoch+refcount approach I took, which is
noticeably better on x86-64, and I don't think hurts any other
architecture, to say the least. I think the latter could be
generalized as an alternative flavor of refcount_t, but I'd prefer to
land it in uprobes in current shape, and if we think it's a good idea
to generalize, we can always do that refactoring once things stabilize
a bit.

You seem to have problems with the refcounting implementation I did
(besides overflow detection, which I'll address in the next revision,
so not a problem). My arguments are a) performance and b) it's well
contained within get/put helpers and doesn't leak outside of them *at
all*, while providing a nice always successful get_uprobe() primitive.

Can I please hear the arguments for not doing it, besides "Everyone is
using refcount_inc_not_zero", which isn't much of a reason (we'd never
do anything novel in the kernel if that was a good enough reason to
not do something new).

Again, thanks for engagement, I do appreciate it. But let's try to
move this forward. Thanks!