Re: [PATCH v2 bpf-next 2/2] selftests/bpf: add fast mostly in-kernel BPF triggering benchmarks

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Fri, 15 Mar 2024 14:07:01 -0700

On Fri, Mar 15, 2024 at 11:48 AM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Fri, Mar 15, 2024 at 9:59 AM Andrii Nakryiko
> <andrii.nakryiko@xxxxxxxxx> wrote:
> >
> > On Fri, Mar 15, 2024 at 9:31 AM Andrii Nakryiko
> > <andrii.nakryiko@xxxxxxxxx> wrote:
> > >
> > > On Fri, Mar 15, 2024 at 9:03 AM Alexei Starovoitov
> > > <alexei.starovoitov@xxxxxxxxx> wrote:
> > > >
> > > > On Thu, Mar 14, 2024 at 10:18 PM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote:
> > > > >
> > > > > Existing kprobe/fentry triggering benchmarks have 1-to-1 mapping between
> > > > > one syscall execution and BPF program run. While we use a fast
> > > > > get_pgid() syscall, syscall overhead can still be non-trivial.
> > > > >
> > > > > This patch adds kprobe/fentry set of benchmarks significantly amortizing
> > > > > the cost of syscall vs actual BPF triggering overhead. We do this by
> > > > > employing BPF_PROG_TEST_RUN command to trigger "driver" raw_tp program
> > > > > which does a tight parameterized loop calling cheap BPF helper
> > > > > (bpf_get_smp_processor_id()), to which kprobe/fentry programs are
> > > > > attached for benchmarking.
> > > > >
> > > > > This way 1 bpf() syscall causes N executions of BPF program being
> > > > > benchmarked. N defaults to 100, but can be adjusted with
> > > > > --trig-batch-iters CLI argument.
> > > > >
> > > > > Results speak for themselves:
> > > > >
> > > > > $ ./run_bench_trigger.sh
> > > > > uprobe-base         :  138.054 ± 0.556M/s
> > > > > base                :   16.650 ± 0.123M/s
> > > >
> > > > What's going on here? Why two bases are so different?
> > > > I thought the "base" is what all other benchmarks
> > > > should be compared against.
> > > > The "base" is the theoretical maximum for all benchs.
> > > > Or uprobe* benches should be compared with uprobe-base
> > > > while all other benchs compared with 'base' ?
> > > > Probably not anymore due to this new approach.
> > > > The 'base' is kinda lost its value then.
> > >
> > > naming is hard. base is doing syscall(get_pgid) in a tight loop. It's
> > > base compared to previous trigger benchmarks where we used syscall to
> > > trigger kprobe/fentry programs. uprobe-base is just a user-space loop
> > > that does atomic_inc() in a tight loop. So uprobe-base is basically
> > > the measure of how fast CPU is, but it's unrealistic to expect either
> > > fentry/kprobe to get close, and especially it's unrealistic to expect
> > > uprobes to get close to it.
> > >
> > > Naming suggestions are welcome, though.
> > >
> > > I'm not sure what the "base" should be for xxx-fast benchmarks? Doing
> > > a counter loop in driver BPF program, perhaps? Would you like me to
> > > add base-fast benchmark doing just that?
> > >
> >
> > How about this.
> >
> > base -> base-syscall (i.e., "syscall-calling baseline")
> > uprobe-base -> base-user-loop (i.e., "user space-only baseline")
> > and then for "fast" baseline we add "base-kernel-loop" for
> > "kernel-side looping baseline"
>
> I think "base" part in both doesn't fit.
> Maybe
> base -> syscall_loop
> uprobe-base - user_space_loop

ok

>
> since the first is only doing syscall in a loop
> and the 2nd doesn't even go to the kernel.
>
> >
> > Or we can use some naming based on "counting": base-syscall-count,
> > base-user-count, base-kernel-count?
>
> Instead of fenty-fast -> fentry-batch
> sounds more accurate.
>

ok, -batch it is

> But in general it doesn't feel right to keep the existing
> benchs, since we've discovered that syscall overhead
> affects the measurement so much.
> I think we should bite the bullet and use bpf_testmod in bench tool.

This will worsen the logistics of using it for any benchmark so much
that it will be pretty much a useless tool.

I can try to find some other way to trigger tracepoint and fmod_ret
without relying on custom modules, but I'd rather just remove those
benchmarks altogether than add dependency on bpf_testmod.

> Then we can add a tracepoint there and call it in proper kernel loop,
> then add fmodret-able empty function and call it in another loop.
>
> Then bench tool can trigger them via bpf->kfunc->that_looping_helper
> and capture as accurate as possibly overhead of
> tp, raw_tp, kprobe, fentry, fmodret.
> And all of them will be comparable to each other.
> Right now fmodret vs fentry are comparable, but both have
> syscall time addon which makes the comparison questionable.
> Especially consider the kernels with mitigations.
> Overhead of syscall is even higher and the delta is smaller,
> to the point that fentry vs fexit might measure equal.

It's still useful to be able to do this, even if syscall overhead is
very measurable. Because when profiling it's easy to filter out
syscall overhead and compare purely fentry vs fexit overheads.

Even with mitigations on the differences between fentry and fexit are
measurable. And bench is a nice tool to generate non-stop work to be
able to capture profiling data conveniently. While a custom test
module will make it much harder.

> A benchmarking tool should be a gold standard and accurate tool
> for performance measurement.
>

This was neither a goal, nor do I sign up for that aspiration :) It's
first and foremost a small and useful tool for doing local performance
optimization work when it comes to BPF-related functionality in the
kernel. And its portability is a huge part of this.

> The trick:
> +       for (i = 0; i < batch_iters; i++)
> +               (void)bpf_get_smp_processor_id(); /* attach here to benchmark */
>
> is neat, but since it works for fentry/kprobe only
> we need testmod to generalize it for fmodret and tracepoints.

see above, I agree it would be nice to have (and I'll look for some
other ways to achieve this), but is not really a required property to
make benchmark useful

For now, I'll drop this patch and re-submit the first one with
__always_inline removed just to get those prerequisites landed.