On Thu, Mar 14, 2024 at 10:18 PM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote: > > When benchmarking with multiple threads (-pN, where N>1), we start > contending on single atomic counter that both BPF trigger benchmarks are > using, as well as "baseline" tests in user space (trig-base and > trig-uprobe-base benchmarks). As such, we start bottlenecking on > something completely irrelevant to benchmark at hand. > > Scale counting up by using per-CPU counters on BPF side. On use space > side we do the next best thing: hash thread ID to approximate per-CPU > behavior. It seems to work quite well in practice. > > To demonstrate the difference, I ran three benchmarks with 1, 2, 4, 8, > 16, and 32 threads: > - trig-uprobe-base (no syscalls, pure tight counting loop in user-space); > - trig-base (get_pgid() syscall, atomic counter in user-space); > - trig-fentry (syscall to trigger fentry program, atomic uncontended per-CPU > counter on BPF side). > > Command used: > > for b in uprobe-base base fentry; do \ > for p in 1 2 4 8 16 32; do \ > printf "%-11s %2d: %s\n" $b $p \ > "$(sudo ./bench -w2 -d5 -a -p$p trig-$b | tail -n1 | cut -d'(' -f1 | cut -d' ' -f3-)"; \ > done; \ > done > > Before these changes, aggregate throughput across all threads doesn't > scale well with number of threads, it actually even falls sharply for > uprobe-base due to a very high contention: > > uprobe-base 1: 138.998 ± 0.650M/s > uprobe-base 2: 70.526 ± 1.147M/s > uprobe-base 4: 63.114 ± 0.302M/s > uprobe-base 8: 54.177 ± 0.138M/s > uprobe-base 16: 45.439 ± 0.057M/s > uprobe-base 32: 37.163 ± 0.242M/s > base 1: 16.940 ± 0.182M/s > base 2: 19.231 ± 0.105M/s > base 4: 21.479 ± 0.038M/s > base 8: 23.030 ± 0.037M/s > base 16: 22.034 ± 0.004M/s > base 32: 18.152 ± 0.013M/s > fentry 1: 14.794 ± 0.054M/s > fentry 2: 17.341 ± 0.055M/s > fentry 4: 23.792 ± 0.024M/s > fentry 8: 21.557 ± 0.047M/s > fentry 16: 21.121 ± 0.004M/s > fentry 32: 17.067 ± 0.023M/s > > After these changes, we see almost perfect linear scaling, as expected. > The sub-linear scaling when going from 8 to 16 threads is interesting > and consistent on my test machine, but I haven't investigated what is > causing it this peculiar slowdown (across all benchmarks, could be due > to hyperthreading effects, not sure). > > uprobe-base 1: 139.980 ± 0.648M/s > uprobe-base 2: 270.244 ± 0.379M/s > uprobe-base 4: 532.044 ± 1.519M/s > uprobe-base 8: 1004.571 ± 3.174M/s > uprobe-base 16: 1720.098 ± 0.744M/s > uprobe-base 32: 3506.659 ± 8.549M/s > base 1: 16.869 ± 0.071M/s > base 2: 33.007 ± 0.092M/s > base 4: 64.670 ± 0.203M/s > base 8: 121.969 ± 0.210M/s > base 16: 207.832 ± 0.112M/s > base 32: 424.227 ± 1.477M/s > fentry 1: 14.777 ± 0.087M/s > fentry 2: 28.575 ± 0.146M/s > fentry 4: 56.234 ± 0.176M/s > fentry 8: 106.095 ± 0.385M/s > fentry 16: 181.440 ± 0.032M/s > fentry 32: 369.131 ± 0.693M/s > > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx> > --- > .../selftests/bpf/benchs/bench_trigger.c | 40 ++++++++++++++++--- > .../selftests/bpf/progs/trigger_bench.c | 39 ++++++++++++------ > 2 files changed, 62 insertions(+), 17 deletions(-) > > diff --git a/tools/testing/selftests/bpf/benchs/bench_trigger.c b/tools/testing/selftests/bpf/benchs/bench_trigger.c > index ace0d1011a8e..8fbc78d5f8a4 100644 > --- a/tools/testing/selftests/bpf/benchs/bench_trigger.c > +++ b/tools/testing/selftests/bpf/benchs/bench_trigger.c > @@ -1,15 +1,45 @@ > // SPDX-License-Identifier: GPL-2.0 > /* Copyright (c) 2020 Facebook */ > +#define _GNU_SOURCE > +#include <unistd.h> > #include "bench.h" > #include "trigger_bench.skel.h" > #include "trace_helpers.h" > > +/* adjust slot shift in inc_hits() if changing */ > +#define MAX_BUCKETS 256 > + > /* BPF triggering benchmarks */ > static struct trigger_ctx { > struct trigger_bench *skel; > } ctx; > > -static struct counter base_hits; > +static struct counter base_hits[MAX_BUCKETS]; > + > +static __always_inline void inc_counter(struct counter *counters) > +{ > + static __thread int tid = 0; > + unsigned slot; > + > + if (unlikely(tid == 0)) > + tid = gettid(); > + > + /* multiplicative hashing, it's fast */ > + slot = 2654435769U * tid; > + slot >>= 24; > + > + atomic_inc(&base_hits[slot].value); /* use highest byte as an index */ > +} > + > +static __always_inline long sum_and_reset_counters(struct counter *counters) Why __always_inline ? It's a slow path. Inlining or not shouldn't make any difference. > +{ > + int i; > + long sum = 0; > + > + for (i = 0; i < MAX_BUCKETS; i++) > + sum += atomic_swap(&counters[i].value, 0); > + return sum; > +}