Re: [PATCH v2 bpf-next 2/2] selftests/bpf: add fast mostly in-kernel BPF triggering benchmarks

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Fri, 15 Mar 2024 11:47:58 -0700

On Fri, Mar 15, 2024 at 9:59 AM Andrii Nakryiko
<andrii.nakryiko@xxxxxxxxx> wrote:
>
> On Fri, Mar 15, 2024 at 9:31 AM Andrii Nakryiko
> <andrii.nakryiko@xxxxxxxxx> wrote:
> >
> > On Fri, Mar 15, 2024 at 9:03 AM Alexei Starovoitov
> > <alexei.starovoitov@xxxxxxxxx> wrote:
> > >
> > > On Thu, Mar 14, 2024 at 10:18 PM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote:
> > > >
> > > > Existing kprobe/fentry triggering benchmarks have 1-to-1 mapping between
> > > > one syscall execution and BPF program run. While we use a fast
> > > > get_pgid() syscall, syscall overhead can still be non-trivial.
> > > >
> > > > This patch adds kprobe/fentry set of benchmarks significantly amortizing
> > > > the cost of syscall vs actual BPF triggering overhead. We do this by
> > > > employing BPF_PROG_TEST_RUN command to trigger "driver" raw_tp program
> > > > which does a tight parameterized loop calling cheap BPF helper
> > > > (bpf_get_smp_processor_id()), to which kprobe/fentry programs are
> > > > attached for benchmarking.
> > > >
> > > > This way 1 bpf() syscall causes N executions of BPF program being
> > > > benchmarked. N defaults to 100, but can be adjusted with
> > > > --trig-batch-iters CLI argument.
> > > >
> > > > Results speak for themselves:
> > > >
> > > > $ ./run_bench_trigger.sh
> > > > uprobe-base         :  138.054 ± 0.556M/s
> > > > base                :   16.650 ± 0.123M/s
> > >
> > > What's going on here? Why two bases are so different?
> > > I thought the "base" is what all other benchmarks
> > > should be compared against.
> > > The "base" is the theoretical maximum for all benchs.
> > > Or uprobe* benches should be compared with uprobe-base
> > > while all other benchs compared with 'base' ?
> > > Probably not anymore due to this new approach.
> > > The 'base' is kinda lost its value then.
> >
> > naming is hard. base is doing syscall(get_pgid) in a tight loop. It's
> > base compared to previous trigger benchmarks where we used syscall to
> > trigger kprobe/fentry programs. uprobe-base is just a user-space loop
> > that does atomic_inc() in a tight loop. So uprobe-base is basically
> > the measure of how fast CPU is, but it's unrealistic to expect either
> > fentry/kprobe to get close, and especially it's unrealistic to expect
> > uprobes to get close to it.
> >
> > Naming suggestions are welcome, though.
> >
> > I'm not sure what the "base" should be for xxx-fast benchmarks? Doing
> > a counter loop in driver BPF program, perhaps? Would you like me to
> > add base-fast benchmark doing just that?
> >
>
> How about this.
>
> base -> base-syscall (i.e., "syscall-calling baseline")
> uprobe-base -> base-user-loop (i.e., "user space-only baseline")
> and then for "fast" baseline we add "base-kernel-loop" for
> "kernel-side looping baseline"

I think "base" part in both doesn't fit.
Maybe
base -> syscall_loop
uprobe-base - user_space_loop

since the first is only doing syscall in a loop
and the 2nd doesn't even go to the kernel.

>
> Or we can use some naming based on "counting": base-syscall-count,
> base-user-count, base-kernel-count?

Instead of fenty-fast -> fentry-batch
sounds more accurate.

But in general it doesn't feel right to keep the existing
benchs, since we've discovered that syscall overhead
affects the measurement so much.
I think we should bite the bullet and use bpf_testmod in bench tool.
Then we can add a tracepoint there and call it in proper kernel loop,
then add fmodret-able empty function and call it in another loop.

Then bench tool can trigger them via bpf->kfunc->that_looping_helper
and capture as accurate as possibly overhead of
tp, raw_tp, kprobe, fentry, fmodret.
And all of them will be comparable to each other.
Right now fmodret vs fentry are comparable, but both have
syscall time addon which makes the comparison questionable.
Especially consider the kernels with mitigations.
Overhead of syscall is even higher and the delta is smaller,
to the point that fentry vs fexit might measure equal.
A benchmarking tool should be a gold standard and accurate tool
for performance measurement.

The trick:
+       for (i = 0; i < batch_iters; i++)
+               (void)bpf_get_smp_processor_id(); /* attach here to benchmark */

is neat, but since it works for fentry/kprobe only
we need testmod to generalize it for fmodret and tracepoints.