Re: [PATCH v2 bpf-next 2/2] selftests/bpf: add fast mostly in-kernel BPF triggering benchmarks

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Fri, 15 Mar 2024 09:03:12 -0700

On Thu, Mar 14, 2024 at 10:18 PM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote:
>
> Existing kprobe/fentry triggering benchmarks have 1-to-1 mapping between
> one syscall execution and BPF program run. While we use a fast
> get_pgid() syscall, syscall overhead can still be non-trivial.
>
> This patch adds kprobe/fentry set of benchmarks significantly amortizing
> the cost of syscall vs actual BPF triggering overhead. We do this by
> employing BPF_PROG_TEST_RUN command to trigger "driver" raw_tp program
> which does a tight parameterized loop calling cheap BPF helper
> (bpf_get_smp_processor_id()), to which kprobe/fentry programs are
> attached for benchmarking.
>
> This way 1 bpf() syscall causes N executions of BPF program being
> benchmarked. N defaults to 100, but can be adjusted with
> --trig-batch-iters CLI argument.
>
> Results speak for themselves:
>
> $ ./run_bench_trigger.sh
> uprobe-base         :  138.054 ± 0.556M/s
> base                :   16.650 ± 0.123M/s

What's going on here? Why two bases are so different?
I thought the "base" is what all other benchmarks
should be compared against.
The "base" is the theoretical maximum for all benchs.
Or uprobe* benches should be compared with uprobe-base
while all other benchs compared with 'base' ?
Probably not anymore due to this new approach.
The 'base' is kinda lost its value then.

> tp                  :   11.068 ± 0.100M/s
> rawtp               :   14.087 ± 0.511M/s
> kprobe              :    9.641 ± 0.027M/s
> kprobe-multi        :   10.263 ± 0.061M/s
> kretprobe           :    5.475 ± 0.028M/s
> kretprobe-multi     :    5.703 ± 0.036M/s
> fentry              :   14.544 ± 0.112M/s
> fexit               :   10.637 ± 0.073M/s
> fmodret             :   11.357 ± 0.061M/s
> kprobe-fast         :   14.286 ± 0.377M/s
> kprobe-multi-fast   :   14.999 ± 0.204M/s
> kretprobe-fast      :    7.646 ± 0.084M/s
> kretprobe-multi-fast:    4.354 ± 0.066M/s
> fentry-fast         :   31.475 ± 0.254M/s
> fexit-fast          :   17.379 ± 0.195M/s

I think the "-fast" suffix doesn't really fit here.
It's a different way to benchmark fexit vs kprobe overhead.

I think the old approach should be replaced with the new one.
There is no value in keeping the old stuff around when
now it's clear that it's measuring much more than it should.

> One huge and not yet explained deviation is a slowdown of
> kretprobe-multi, we should look into that separately.
>
> kretprobe           :    5.475 ± 0.028M/s
> kretprobe-multi     :    5.703 ± 0.036M/s
> kretprobe-fast      :    7.646 ± 0.084M/s
> kretprobe-multi-fast:    4.354 ± 0.066M/s

That is weird indeed. Since it's a single cpu run
it cannot be due to contention.
My gut feel is that something odd is here:
#ifdef CONFIG_X86_KERNEL_IBT
static unsigned long get_entry_ip(unsigned long fentry_ip)

Did you have it in your config?

> +SEC("raw_tp")
> +int trigger_driver(void *ctx)
> +{
> +       int i;
> +
> +       for (i = 0; i < batch_iters; i++)
> +               (void)bpf_get_smp_processor_id(); /* attach here to benchmark */

bpf_get_numa_node_id() is probably even faster and
not subject to DEBUG configs.