Re: [PATCH v2 bpf-next 2/2] selftests/bpf: add fast mostly in-kernel BPF triggering benchmarks

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Fri, 15 Mar 2024 09:58:50 -0700

On Fri, Mar 15, 2024 at 9:31 AM Andrii Nakryiko
<andrii.nakryiko@xxxxxxxxx> wrote:
>
> On Fri, Mar 15, 2024 at 9:03 AM Alexei Starovoitov
> <alexei.starovoitov@xxxxxxxxx> wrote:
> >
> > On Thu, Mar 14, 2024 at 10:18 PM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote:
> > >
> > > Existing kprobe/fentry triggering benchmarks have 1-to-1 mapping between
> > > one syscall execution and BPF program run. While we use a fast
> > > get_pgid() syscall, syscall overhead can still be non-trivial.
> > >
> > > This patch adds kprobe/fentry set of benchmarks significantly amortizing
> > > the cost of syscall vs actual BPF triggering overhead. We do this by
> > > employing BPF_PROG_TEST_RUN command to trigger "driver" raw_tp program
> > > which does a tight parameterized loop calling cheap BPF helper
> > > (bpf_get_smp_processor_id()), to which kprobe/fentry programs are
> > > attached for benchmarking.
> > >
> > > This way 1 bpf() syscall causes N executions of BPF program being
> > > benchmarked. N defaults to 100, but can be adjusted with
> > > --trig-batch-iters CLI argument.
> > >
> > > Results speak for themselves:
> > >
> > > $ ./run_bench_trigger.sh
> > > uprobe-base         :  138.054 ± 0.556M/s
> > > base                :   16.650 ± 0.123M/s
> >
> > What's going on here? Why two bases are so different?
> > I thought the "base" is what all other benchmarks
> > should be compared against.
> > The "base" is the theoretical maximum for all benchs.
> > Or uprobe* benches should be compared with uprobe-base
> > while all other benchs compared with 'base' ?
> > Probably not anymore due to this new approach.
> > The 'base' is kinda lost its value then.
>
> naming is hard. base is doing syscall(get_pgid) in a tight loop. It's
> base compared to previous trigger benchmarks where we used syscall to
> trigger kprobe/fentry programs. uprobe-base is just a user-space loop
> that does atomic_inc() in a tight loop. So uprobe-base is basically
> the measure of how fast CPU is, but it's unrealistic to expect either
> fentry/kprobe to get close, and especially it's unrealistic to expect
> uprobes to get close to it.
>
> Naming suggestions are welcome, though.
>
> I'm not sure what the "base" should be for xxx-fast benchmarks? Doing
> a counter loop in driver BPF program, perhaps? Would you like me to
> add base-fast benchmark doing just that?
>

How about this.

base -> base-syscall (i.e., "syscall-calling baseline")
uprobe-base -> base-user-loop (i.e., "user space-only baseline")
and then for "fast" baseline we add "base-kernel-loop" for
"kernel-side looping baseline"

Or we can use some naming based on "counting": base-syscall-count,
base-user-count, base-kernel-count?

Any preferences, anyone?

>
> >
> > > tp                  :   11.068 ± 0.100M/s
> > > rawtp               :   14.087 ± 0.511M/s
> > > kprobe              :    9.641 ± 0.027M/s
> > > kprobe-multi        :   10.263 ± 0.061M/s
> > > kretprobe           :    5.475 ± 0.028M/s
> > > kretprobe-multi     :    5.703 ± 0.036M/s
> > > fentry              :   14.544 ± 0.112M/s
> > > fexit               :   10.637 ± 0.073M/s
> > > fmodret             :   11.357 ± 0.061M/s
> > > kprobe-fast         :   14.286 ± 0.377M/s
> > > kprobe-multi-fast   :   14.999 ± 0.204M/s
> > > kretprobe-fast      :    7.646 ± 0.084M/s
> > > kretprobe-multi-fast:    4.354 ± 0.066M/s
> > > fentry-fast         :   31.475 ± 0.254M/s
> > > fexit-fast          :   17.379 ± 0.195M/s
> >
> > I think the "-fast" suffix doesn't really fit here.
> > It's a different way to benchmark fexit vs kprobe overhead.
> >
> > I think the old approach should be replaced with the new one.
> > There is no value in keeping the old stuff around when
> > now it's clear that it's measuring much more than it should.
>
> I'd say we should keep them. Note that fmod_ret, tp, and raw_tp can't
> be benchmarked the same way. So we don't have -fast variants for them.
> If we get rid of pre-existing fentry/kprobe benchmarks, we won't be
> able to compare raw_tp/fmod_ret/fexit.s to them.
>
> And to be clear, in practice I'm not comparing xxx-fast with xxx at
> all. I just put all the numbers in this patch summary to show that we
> indeed have a significant performance lost on syscall.
>
> In practice when comparing kprobe vs fentry, it should be kprobe-fast
> vs fentry-fast.
>
> >
> > > One huge and not yet explained deviation is a slowdown of
> > > kretprobe-multi, we should look into that separately.
> > >
> > > kretprobe           :    5.475 ± 0.028M/s
> > > kretprobe-multi     :    5.703 ± 0.036M/s
> > > kretprobe-fast      :    7.646 ± 0.084M/s
> > > kretprobe-multi-fast:    4.354 ± 0.066M/s
> >
> > That is weird indeed. Since it's a single cpu run
> > it cannot be due to contention.
> > My gut feel is that something odd is here:
> > #ifdef CONFIG_X86_KERNEL_IBT
> > static unsigned long get_entry_ip(unsigned long fentry_ip)
> >
> > Did you have it in your config?
>
> No, this was without IBT. I do have an upcoming small patch to
> eliminate unnecessary copy_kernel_nofault() in IBT path, but I don't
> think this is it. It should be pretty clear with profiling where we
> are losing CPU, I just didn't have time/desire to look into it, as I
> have tons of small things being worked on in parallel and didn't want
> to add another one.
>
> >
> > > +SEC("raw_tp")
> > > +int trigger_driver(void *ctx)
> > > +{
> > > +       int i;
> > > +
> > > +       for (i = 0; i < batch_iters; i++)
> > > +               (void)bpf_get_smp_processor_id(); /* attach here to benchmark */
> >
> > bpf_get_numa_node_id() is probably even faster and
> > not subject to DEBUG configs.
>
> Sure, I can switch to that, np.