Re: [PATCH v2 bpf-next 1/2] selftests/bpf: scale benchmark counting by using per-CPU counters

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Fri, 15 Mar 2024 09:22:14 -0700



On Fri, Mar 15, 2024 at 9:08 AM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Thu, Mar 14, 2024 at 10:18 PM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote:
> >
> > When benchmarking with multiple threads (-pN, where N>1), we start
> > contending on single atomic counter that both BPF trigger benchmarks are
> > using, as well as "baseline" tests in user space (trig-base and
> > trig-uprobe-base benchmarks). As such, we start bottlenecking on
> > something completely irrelevant to benchmark at hand.
> >
> > Scale counting up by using per-CPU counters on BPF side. On use space
> > side we do the next best thing: hash thread ID to approximate per-CPU
> > behavior. It seems to work quite well in practice.
> >
> > To demonstrate the difference, I ran three benchmarks with 1, 2, 4, 8,
> > 16, and 32 threads:
> >   - trig-uprobe-base (no syscalls, pure tight counting loop in user-space);
> >   - trig-base (get_pgid() syscall, atomic counter in user-space);
> >   - trig-fentry (syscall to trigger fentry program, atomic uncontended per-CPU
> >     counter on BPF side).
> >
> > Command used:
> >
> >   for b in uprobe-base base fentry; do \
> >     for p in 1 2 4 8 16 32; do \
> >       printf "%-11s %2d: %s\n" $b $p \
> >         "$(sudo ./bench -w2 -d5 -a -p$p trig-$b | tail -n1 | cut -d'(' -f1 | cut -d' ' -f3-)"; \
> >     done; \
> >   done
> >
> > Before these changes, aggregate throughput across all threads doesn't
> > scale well with number of threads, it actually even falls sharply for
> > uprobe-base due to a very high contention:
> >
> >   uprobe-base  1:  138.998 ± 0.650M/s
> >   uprobe-base  2:   70.526 ± 1.147M/s
> >   uprobe-base  4:   63.114 ± 0.302M/s
> >   uprobe-base  8:   54.177 ± 0.138M/s
> >   uprobe-base 16:   45.439 ± 0.057M/s
> >   uprobe-base 32:   37.163 ± 0.242M/s
> >   base         1:   16.940 ± 0.182M/s
> >   base         2:   19.231 ± 0.105M/s
> >   base         4:   21.479 ± 0.038M/s
> >   base         8:   23.030 ± 0.037M/s
> >   base        16:   22.034 ± 0.004M/s
> >   base        32:   18.152 ± 0.013M/s
> >   fentry       1:   14.794 ± 0.054M/s
> >   fentry       2:   17.341 ± 0.055M/s
> >   fentry       4:   23.792 ± 0.024M/s
> >   fentry       8:   21.557 ± 0.047M/s
> >   fentry      16:   21.121 ± 0.004M/s
> >   fentry      32:   17.067 ± 0.023M/s
> >
> > After these changes, we see almost perfect linear scaling, as expected.
> > The sub-linear scaling when going from 8 to 16 threads is interesting
> > and consistent on my test machine, but I haven't investigated what is
> > causing it this peculiar slowdown (across all benchmarks, could be due
> > to hyperthreading effects, not sure).
> >
> >   uprobe-base  1:  139.980 ± 0.648M/s
> >   uprobe-base  2:  270.244 ± 0.379M/s
> >   uprobe-base  4:  532.044 ± 1.519M/s
> >   uprobe-base  8: 1004.571 ± 3.174M/s
> >   uprobe-base 16: 1720.098 ± 0.744M/s
> >   uprobe-base 32: 3506.659 ± 8.549M/s
> >   base         1:   16.869 ± 0.071M/s
> >   base         2:   33.007 ± 0.092M/s
> >   base         4:   64.670 ± 0.203M/s
> >   base         8:  121.969 ± 0.210M/s
> >   base        16:  207.832 ± 0.112M/s
> >   base        32:  424.227 ± 1.477M/s
> >   fentry       1:   14.777 ± 0.087M/s
> >   fentry       2:   28.575 ± 0.146M/s
> >   fentry       4:   56.234 ± 0.176M/s
> >   fentry       8:  106.095 ± 0.385M/s
> >   fentry      16:  181.440 ± 0.032M/s
> >   fentry      32:  369.131 ± 0.693M/s
> >
> > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx>
> > ---
> >  .../selftests/bpf/benchs/bench_trigger.c      | 40 ++++++++++++++++---
> >  .../selftests/bpf/progs/trigger_bench.c       | 39 ++++++++++++------
> >  2 files changed, 62 insertions(+), 17 deletions(-)
> >
> > diff --git a/tools/testing/selftests/bpf/benchs/bench_trigger.c b/tools/testing/selftests/bpf/benchs/bench_trigger.c
> > index ace0d1011a8e..8fbc78d5f8a4 100644
> > --- a/tools/testing/selftests/bpf/benchs/bench_trigger.c
> > +++ b/tools/testing/selftests/bpf/benchs/bench_trigger.c
> > @@ -1,15 +1,45 @@
> >  // SPDX-License-Identifier: GPL-2.0
> >  /* Copyright (c) 2020 Facebook */
> > +#define _GNU_SOURCE
> > +#include <unistd.h>
> >  #include "bench.h"
> >  #include "trigger_bench.skel.h"
> >  #include "trace_helpers.h"
> >
> > +/* adjust slot shift in inc_hits() if changing */
> > +#define MAX_BUCKETS 256
> > +
> >  /* BPF triggering benchmarks */
> >  static struct trigger_ctx {
> >         struct trigger_bench *skel;
> >  } ctx;
> >
> > -static struct counter base_hits;
> > +static struct counter base_hits[MAX_BUCKETS];
> > +
> > +static __always_inline void inc_counter(struct counter *counters)
> > +{
> > +       static __thread int tid = 0;
> > +       unsigned slot;
> > +
> > +       if (unlikely(tid == 0))
> > +               tid = gettid();
> > +
> > +       /* multiplicative hashing, it's fast */
> > +       slot = 2654435769U * tid;
> > +       slot >>= 24;
> > +
> > +       atomic_inc(&base_hits[slot].value); /* use highest byte as an index */
> > +}
> > +
> > +static __always_inline long sum_and_reset_counters(struct counter *counters)
>
> Why __always_inline ?
> It's a slow path. Inlining or not shouldn't make any difference.

copy/paste of inc_counter signature originally, you are right, it's
not necessary

>
> > +{
> > +       int i;
> > +       long sum = 0;
> > +
> > +       for (i = 0; i < MAX_BUCKETS; i++)
> > +               sum += atomic_swap(&counters[i].value, 0);
> > +       return sum;
> > +}