Re: [PATCH] uprobes: get rid of bogus trace_uprobe hit counter

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Tue, 6 Aug 2024 10:26:25 -0700

On Tue, Aug 6, 2024 at 12:37 AM Jiri Olsa <olsajiri@xxxxxxxxx> wrote:
>
> On Mon, Aug 05, 2024 at 01:28:03PM -0700, Andrii Nakryiko wrote:
> > trace_uprobe->nhit counter is not incremented atomically, so its value
> > is bogus in practice. On the other hand, it's actually a pretty big
> > uprobe scalability problem due to heavy cache line bouncing between CPUs
> > triggering the same uprobe.
>
> so you're seeing that in the benchmark, right? I'm curious how bad
> the numbers are
>

Yes. So, once we get rid of all the uprobe/uretprobe/mm locks (ongoing
work), this one was the last limiter to linear scalability.

With this counter, I was topping out at about 12 mln/s uprobe
triggering (I think it was 32 CPUs, but I don't remember exactly now).
About 30% of CPU cycles were spent in this increment.

But those 30% don't paint the full picture. Once the counter is
removed, the same uprobe throughput jumps to 62 mln/s or so. So we
definitely have to do something about it.

> >
> > Drop it and emit obviously unrealistic value in its stead in
> > uporbe_profiler seq file.
> >
> > The alternative would be allocating per-CPU counter, but I'm not sure
> > it's justified.
> >
> > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx>
> > ---
> >  kernel/trace/trace_uprobe.c | 4 +---
> >  1 file changed, 1 insertion(+), 3 deletions(-)
> >
> > diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> > index 52e76a73fa7c..5d38207db479 100644
> > --- a/kernel/trace/trace_uprobe.c
> > +++ b/kernel/trace/trace_uprobe.c
> > @@ -62,7 +62,6 @@ struct trace_uprobe {
> >       struct uprobe                   *uprobe;
> >       unsigned long                   offset;
> >       unsigned long                   ref_ctr_offset;
> > -     unsigned long                   nhit;
> >       struct trace_probe              tp;
> >  };
> >
> > @@ -821,7 +820,7 @@ static int probes_profile_seq_show(struct seq_file *m, void *v)
> >
> >       tu = to_trace_uprobe(ev);
> >       seq_printf(m, "  %s %-44s %15lu\n", tu->filename,
> > -                     trace_probe_name(&tu->tp), tu->nhit);
> > +                trace_probe_name(&tu->tp), ULONG_MAX);
>
> seems harsh.. would it be that bad to create per cpu counter for that?

Well, consider this patch a conversation starter. There are two
reasons why I'm removing the counter instead of doing per-CPU one:

  - it's less work to send out a patch pointing out the problem (but
the solution might change)
  - this counter was never correct in the presence of multiple
threads, so I'm not sure how useful it is.

Yes, I think we can do per-CPU counters, but do we want to pay the
memory price? That's what I want to get from Masami, Steven, or Peter
(whoever cares enough).

>
> jirka
>
> >       return 0;
> >  }
> >
> > @@ -1507,7 +1506,6 @@ static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs)
> >       int ret = 0;
> >
> >       tu = container_of(con, struct trace_uprobe, consumer);
> > -     tu->nhit++;
> >
> >       udd.tu = tu;
> >       udd.bp_addr = instruction_pointer(regs);
> > --
> > 2.43.5
> >