Re: [RFC PATCH bpf-next v2 2/2] [no_merge] selftests/bpf: Benchmark runtime performance with private stack

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Fri, 12 Jul 2024 14:47:49 -0700



On Fri, Jul 12, 2024 at 1:48 PM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
>
>
> On 7/12/24 1:16 PM, Alexei Starovoitov wrote:
> > On Thu, Jul 11, 2024 at 9:42 AM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
> >>
> >> It is clear that the main overhead is the push/pop r9 for
> >> three calls.
> >>
> >> Five runs of the benchmarks:
> >>
> >> [root@arch-fb-vm1 bpf]# ./benchs/run_bench_private_stack.sh
> >> no-private-stack:    0.662 ± 0.019M/s (drops 0.000 ± 0.000M/s)
> >> private-stack:       0.673 ± 0.017M/s (drops 0.000 ± 0.000M/s)
> >> [root@arch-fb-vm1 bpf]# ./benchs/run_bench_private_stack.sh
> >> no-private-stack:    0.684 ± 0.005M/s (drops 0.000 ± 0.000M/s)
> >> private-stack:       0.676 ± 0.008M/s (drops 0.000 ± 0.000M/s)
> >> [root@arch-fb-vm1 bpf]# ./benchs/run_bench_private_stack.sh
> >> no-private-stack:    0.673 ± 0.017M/s (drops 0.000 ± 0.000M/s)
> >> private-stack:       0.683 ± 0.006M/s (drops 0.000 ± 0.000M/s)
> >> [root@arch-fb-vm1 bpf]# ./benchs/run_bench_private_stack.sh
> >> no-private-stack:    0.680 ± 0.011M/s (drops 0.000 ± 0.000M/s)
> >> private-stack:       0.626 ± 0.050M/s (drops 0.000 ± 0.000M/s)
> >> [root@arch-fb-vm1 bpf]# ./benchs/run_bench_private_stack.sh
> >> no-private-stack:    0.686 ± 0.007M/s (drops 0.000 ± 0.000M/s)
> >> private-stack:       0.683 ± 0.003M/s (drops 0.000 ± 0.000M/s)
> >>
> >> The performance is very similar between private-stack and no-private-stack.
> > I'm not so sure.
> > What is the "perf report" before/after?
> > Are you sure that bench spends enough time inside the program itself?
> > By the look of it it seems that most of the time will be in hashmap
> > and syscall overhead.
> >
> > You need that batch's one that uses for loop and attached to a helper.
> > See commit 7df4e597ea2c ("selftests/bpf: add batched, mostly in-kernel
> > BPF triggering benchmarks")
>
> Okay, I see. The current approach is one trigger, one prog run where
> each prog run exercise 3 syscalls. I should add a loop to the bpf
> program to make bpf program spends majority of time. Will do this
> in the next revision, plus running 'perf report'.

please also benchmark on real hardware, VM will not give reliable results

>
> >
> > I think the next version doesn't need RFC tag. patch 1 lgtm.
>