Re: [PATCH RFC v3 09/13] uprobes: SRCU-protect uretprobe lifetime (with timeout)

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Mon, 19 Aug 2024 13:34:57 -0700

On Mon, Aug 19, 2024 at 6:41 AM Oleg Nesterov <oleg@xxxxxxxxxx> wrote:
>
> On 08/12, Andrii Nakryiko wrote:
> >
> > Avoid taking refcount on uprobe in prepare_uretprobe(), instead take
> > uretprobe-specific SRCU lock and keep it active as kernel transfers
> > control back to user space.
> ...
> >  include/linux/uprobes.h |  49 ++++++-
> >  kernel/events/uprobes.c | 294 ++++++++++++++++++++++++++++++++++------
> >  2 files changed, 301 insertions(+), 42 deletions(-)
>
> Oh. To be honest I don't like this patch.
>
> I would like to know what other reviewers think, but to me it adds too many
> complications that I can't even fully understand...

Which parts? The atomic xchg() and cmpxchg() parts? What exactly do
you feel like you don't fully understand?

>
> And how much does it help performance-wise?

A lot, as we increase uprobe parallelism. Here's a subset of
benchmarks for 1-4, 8, 16, 32, 64, and 80 threads firing uretprobe.
With and without this SRCU change, but including all the other
changes, including the lockless VMA lookup. It's noticeable already
with just two competing CPUs/threads, and it just gets much worse from
there.

Of course in production you shouldn't come close to such rates of
uprobe/uretprobe firing, so this is definitely a microbenchmark
emphasizing the sharing between CPUs, but it still adds up. And we do
have production use cases that would like to fire uprobes at 100K+ per
second rates.

WITH SRCU for uretprobes
========================
uretprobe-nop         ( 1 cpus):    1.968 ± 0.001M/s  (  1.968M/s/cpu)
uretprobe-nop         ( 2 cpus):    3.739 ± 0.003M/s  (  1.869M/s/cpu)
uretprobe-nop         ( 3 cpus):    5.616 ± 0.003M/s  (  1.872M/s/cpu)
uretprobe-nop         ( 4 cpus):    7.286 ± 0.002M/s  (  1.822M/s/cpu)
uretprobe-nop         ( 8 cpus):   13.657 ± 0.007M/s  (  1.707M/s/cpu)
uretprobe-nop         (32 cpus):   45.305 ± 0.066M/s  (  1.416M/s/cpu)
uretprobe-nop         (64 cpus):   42.390 ± 0.922M/s  (  0.662M/s/cpu)
uretprobe-nop         (80 cpus):   47.554 ± 2.411M/s  (  0.594M/s/cpu)

WITHOUT SRCU for uretprobes
===========================
uretprobe-nop         ( 1 cpus):    2.197 ± 0.002M/s  (  2.197M/s/cpu)
uretprobe-nop         ( 2 cpus):    3.325 ± 0.001M/s  (  1.662M/s/cpu)
uretprobe-nop         ( 3 cpus):    4.129 ± 0.002M/s  (  1.376M/s/cpu)
uretprobe-nop         ( 4 cpus):    6.180 ± 0.003M/s  (  1.545M/s/cpu)
uretprobe-nop         ( 8 cpus):    7.323 ± 0.005M/s  (  0.915M/s/cpu)
uretprobe-nop         (16 cpus):    6.943 ± 0.005M/s  (  0.434M/s/cpu)
uretprobe-nop         (32 cpus):    5.931 ± 0.014M/s  (  0.185M/s/cpu)
uretprobe-nop         (64 cpus):    5.145 ± 0.003M/s  (  0.080M/s/cpu)
uretprobe-nop         (80 cpus):    4.925 ± 0.005M/s  (  0.062M/s/cpu)

>
> I'll try to take another look, and I'll try to think about other approaches,
> not that I have something better in mind...

Ok.

>
>
> But lets forgets this patch for the moment. The next one adds even more
> complications, and I think it doesn't make sense.
>

"Even more complications" is a bit of an overstatement. It just
applies everything we do for uretprobes in this patch to a very
straightforward single-stepped case.

> As I have already mentioned in the previous discussions, we can simply kill
> utask->active_uprobe. And utask->auprobe.

I don't have anything against that, in principle, but let's benchmark
and test that thoroughly. I'm a bit uneasy about the possibility that
some arch-specific code will do container_of() on this arch_uprobe in
order to get to uprobe, we'd need to audit all the code to make sure
that can't happen. Also it's a bit unfortunate that we have to assume
that struct arch_uprobe is small on all architectures, and there is no
code that assumes it can't be moved, etc, etc. (I also don't get why
you need memcpy

>
> So can't we start with the patch below? On top of your 08/13. It doesn't kill
> utask->auprobe yet, this needs a bit more trivial changes.
>
> What do you think?

I think that single-stepped case isn't the main use case (typically
uprobe/uretprobe will be installed on nop or push %rbp, both
emulated). uretprobes, though, are the main use case (along with
optimized entry uprobes). So what we do about single-stepped is a bit
secondary (for me, looking at production use cases).

But we do need to do something with uretprobes first and foremost.

>
> Oleg.
>
> -------------------------------------------------------------------------------
> From d7cb674eb6f7bb891408b2b6a5fb872a6c2f0f6c Mon Sep 17 00:00:00 2001
> From: Oleg Nesterov <oleg@xxxxxxxxxx>
> Date: Mon, 19 Aug 2024 15:34:55 +0200
> Subject: [RFC PATCH] uprobe: kill uprobe_task->active_uprobe
>
> Untested, not for inclusion yet, and I need to split it into 2 changes.
> It does 2 simple things:
>
>         1. active_uprobe != NULL is possible if and only if utask->state != 0,
>            so it turns the active_uprobe checks into the utask->state checks.
>
>         2. handle_singlestep() doesn't really need ->active_uprobe, it only
>            needs uprobe->arch which is "const" after prepare_uprobe().
>
>            So this patch adds the new "arch_uprobe uarch" member into utask
>            and changes pre_ssout() to do memcpy(&utask->uarch, &uprobe->arch).
> ---
>  include/linux/uprobes.h |  2 +-
>  kernel/events/uprobes.c | 37 +++++++++++--------------------------
>  2 files changed, 12 insertions(+), 27 deletions(-)

[...]