On Tue, Aug 6, 2024 at 12:37 AM Jiri Olsa <olsajiri@xxxxxxxxx> wrote: > > On Mon, Aug 05, 2024 at 01:28:03PM -0700, Andrii Nakryiko wrote: > > trace_uprobe->nhit counter is not incremented atomically, so its value > > is bogus in practice. On the other hand, it's actually a pretty big > > uprobe scalability problem due to heavy cache line bouncing between CPUs > > triggering the same uprobe. > > so you're seeing that in the benchmark, right? I'm curious how bad > the numbers are > Yes. So, once we get rid of all the uprobe/uretprobe/mm locks (ongoing work), this one was the last limiter to linear scalability. With this counter, I was topping out at about 12 mln/s uprobe triggering (I think it was 32 CPUs, but I don't remember exactly now). About 30% of CPU cycles were spent in this increment. But those 30% don't paint the full picture. Once the counter is removed, the same uprobe throughput jumps to 62 mln/s or so. So we definitely have to do something about it. > > > > Drop it and emit obviously unrealistic value in its stead in > > uporbe_profiler seq file. > > > > The alternative would be allocating per-CPU counter, but I'm not sure > > it's justified. > > > > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx> > > --- > > kernel/trace/trace_uprobe.c | 4 +--- > > 1 file changed, 1 insertion(+), 3 deletions(-) > > > > diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c > > index 52e76a73fa7c..5d38207db479 100644 > > --- a/kernel/trace/trace_uprobe.c > > +++ b/kernel/trace/trace_uprobe.c > > @@ -62,7 +62,6 @@ struct trace_uprobe { > > struct uprobe *uprobe; > > unsigned long offset; > > unsigned long ref_ctr_offset; > > - unsigned long nhit; > > struct trace_probe tp; > > }; > > > > @@ -821,7 +820,7 @@ static int probes_profile_seq_show(struct seq_file *m, void *v) > > > > tu = to_trace_uprobe(ev); > > seq_printf(m, " %s %-44s %15lu\n", tu->filename, > > - trace_probe_name(&tu->tp), tu->nhit); > > + trace_probe_name(&tu->tp), ULONG_MAX); > > seems harsh.. would it be that bad to create per cpu counter for that? Well, consider this patch a conversation starter. There are two reasons why I'm removing the counter instead of doing per-CPU one: - it's less work to send out a patch pointing out the problem (but the solution might change) - this counter was never correct in the presence of multiple threads, so I'm not sure how useful it is. Yes, I think we can do per-CPU counters, but do we want to pay the memory price? That's what I want to get from Masami, Steven, or Peter (whoever cares enough). > > jirka > > > return 0; > > } > > > > @@ -1507,7 +1506,6 @@ static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs) > > int ret = 0; > > > > tu = container_of(con, struct trace_uprobe, consumer); > > - tu->nhit++; > > > > udd.tu = tu; > > udd.bp_addr = instruction_pointer(regs); > > -- > > 2.43.5 > >