On 7/12/24 1:16 PM, Alexei Starovoitov wrote:
On Thu, Jul 11, 2024 at 9:42 AM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
It is clear that the main overhead is the push/pop r9 for
three calls.
Five runs of the benchmarks:
[root@arch-fb-vm1 bpf]# ./benchs/run_bench_private_stack.sh
no-private-stack: 0.662 ± 0.019M/s (drops 0.000 ± 0.000M/s)
private-stack: 0.673 ± 0.017M/s (drops 0.000 ± 0.000M/s)
[root@arch-fb-vm1 bpf]# ./benchs/run_bench_private_stack.sh
no-private-stack: 0.684 ± 0.005M/s (drops 0.000 ± 0.000M/s)
private-stack: 0.676 ± 0.008M/s (drops 0.000 ± 0.000M/s)
[root@arch-fb-vm1 bpf]# ./benchs/run_bench_private_stack.sh
no-private-stack: 0.673 ± 0.017M/s (drops 0.000 ± 0.000M/s)
private-stack: 0.683 ± 0.006M/s (drops 0.000 ± 0.000M/s)
[root@arch-fb-vm1 bpf]# ./benchs/run_bench_private_stack.sh
no-private-stack: 0.680 ± 0.011M/s (drops 0.000 ± 0.000M/s)
private-stack: 0.626 ± 0.050M/s (drops 0.000 ± 0.000M/s)
[root@arch-fb-vm1 bpf]# ./benchs/run_bench_private_stack.sh
no-private-stack: 0.686 ± 0.007M/s (drops 0.000 ± 0.000M/s)
private-stack: 0.683 ± 0.003M/s (drops 0.000 ± 0.000M/s)
The performance is very similar between private-stack and no-private-stack.
I'm not so sure.
What is the "perf report" before/after?
Are you sure that bench spends enough time inside the program itself?
By the look of it it seems that most of the time will be in hashmap
and syscall overhead.
You need that batch's one that uses for loop and attached to a helper.
See commit 7df4e597ea2c ("selftests/bpf: add batched, mostly in-kernel
BPF triggering benchmarks")
Okay, I see. The current approach is one trigger, one prog run where
each prog run exercise 3 syscalls. I should add a loop to the bpf
program to make bpf program spends majority of time. Will do this
in the next revision, plus running 'perf report'.
I think the next version doesn't need RFC tag. patch 1 lgtm.