On Fri, Mar 15, 2024 at 2:10 AM Jiri Olsa <olsajiri@xxxxxxxxx> wrote: > > On Thu, Mar 14, 2024 at 10:18:13PM -0700, Andrii Nakryiko wrote: > > Existing kprobe/fentry triggering benchmarks have 1-to-1 mapping between > > one syscall execution and BPF program run. While we use a fast > > get_pgid() syscall, syscall overhead can still be non-trivial. > > > > This patch adds kprobe/fentry set of benchmarks significantly amortizing > > the cost of syscall vs actual BPF triggering overhead. We do this by > > employing BPF_PROG_TEST_RUN command to trigger "driver" raw_tp program > > which does a tight parameterized loop calling cheap BPF helper > > (bpf_get_smp_processor_id()), to which kprobe/fentry programs are > > attached for benchmarking. > > > > This way 1 bpf() syscall causes N executions of BPF program being > > benchmarked. N defaults to 100, but can be adjusted with > > --trig-batch-iters CLI argument. > > > > Results speak for themselves: > > > > $ ./run_bench_trigger.sh > > uprobe-base : 138.054 ą 0.556M/s > > base : 16.650 ą 0.123M/s > > tp : 11.068 ą 0.100M/s > > rawtp : 14.087 ą 0.511M/s > > kprobe : 9.641 ą 0.027M/s > > kprobe-multi : 10.263 ą 0.061M/s > > kretprobe : 5.475 ą 0.028M/s > > kretprobe-multi : 5.703 ą 0.036M/s > > fentry : 14.544 ą 0.112M/s > > fexit : 10.637 ą 0.073M/s > > fmodret : 11.357 ą 0.061M/s > > kprobe-fast : 14.286 ą 0.377M/s > > kprobe-multi-fast : 14.999 ą 0.204M/s > > kretprobe-fast : 7.646 ą 0.084M/s > > kretprobe-multi-fast: 4.354 ą 0.066M/s > > fentry-fast : 31.475 ą 0.254M/s > > fexit-fast : 17.379 ą 0.195M/s > > > > Note how xxx-fast variants are measured with significantly higher > > throughput, even though it's exactly the same in-kernel overhead: > > > > fentry : 14.544 ą 0.112M/s > > fentry-fast : 31.475 ą 0.254M/s > > > > kprobe-multi : 10.263 ą 0.061M/s > > kprobe-multi-fast : 14.999 ą 0.204M/s > > > > One huge and not yet explained deviation is a slowdown of > > kretprobe-multi, we should look into that separately. > > > > kretprobe : 5.475 ą 0.028M/s > > kretprobe-multi : 5.703 ą 0.036M/s > > kretprobe-fast : 7.646 ą 0.084M/s > > kretprobe-multi-fast: 4.354 ą 0.066M/s > > > > Kprobe cases don't seem to have this illogical slowdown: > > > > kprobe : 9.641 ą 0.027M/s > > kprobe-multi : 10.263 ą 0.061M/s > > kprobe-fast : 14.286 ą 0.377M/s > > kprobe-multi-fast : 14.999 ą 0.204M/s > > hum, I see that as well: > > uprobe-base : 230.624 ą 0.527M/s > base : 16.320 ą 0.087M/s > tp : 10.057 ą 0.122M/s > rawtp : 14.851 ą 0.300M/s > kprobe : 10.993 ą 0.104M/s > kprobe-multi : 11.053 ą 0.038M/s > kretprobe : 6.679 ą 0.015M/s > kretprobe-multi : 6.466 ą 0.015M/s > fentry : 14.949 ą 0.064M/s > fexit : 10.530 ą 1.275M/s > fmodret : 11.145 ą 0.245M/s > kprobe-fast : 20.080 ą 0.468M/s > kprobe-multi-fast : 17.603 ą 0.102M/s > kretprobe-fast : 9.943 ą 0.056M/s > kretprobe-multi-fast: 5.185 ą 0.022M/s > fentry-fast : 46.697 ą 0.260M/s > fexit-fast : 19.250 ą 0.108M/s > > > I even see decline in kprobe-multi-fast: > > kprobe-fast : 20.080 ą 0.468M/s > kprobe-multi-fast : 17.603 ą 0.102M/s > > kretprobe-fast : 9.943 ą 0.056M/s > kretprobe-multi-fast: 5.185 ą 0.022M/s > > > I've got some IBT related code showing up in the perf profile > and when I disabled it I got better results for kprobe-multi-fast > but kretprobe-multi-fast is still bad ok, I think it's a self-inflicted false alarm. Note how this benchmark is using bpf_get_smp_processor_id() as an attach point? Well, guess what, each triggered program is then calling bpf_get_smp_processor_id() again to get the per-CPU counter to increment. So this benchmark was effectively heavily testing recursion protection, and I guess kretprobe hurts the most. I realized that when I moved to bpf_get_numa_node_id() and seeing much higher numbers: BEFORE ====== $ benchs/run_bench_trigger.sh kprobe-batch fentry-batch kprobe-batch : 14.018 ± 0.333M/s fentry-batch : 31.470 ± 0.791M/s AFTER ===== $ benchs/run_bench_trigger.sh kprobe-batch fentry-batch kprobe-batch : 19.391 ± 0.868M/s fentry-batch : 48.785 ± 0.932M/s That's when I realized something is off here. Now results make a bit more sense (machine on which I test is different, so absolute values differ from before): $ benchs/run_bench_trigger.sh kprobe kprobe-multi kprobe-batch kprobe-multi-batch kretprobe kretprobe-multi kretprobe-batch kretprobe-multi-batch kprobe : 8.114 ± 0.199M/s kprobe-multi : 8.739 ± 0.052M/s kprobe-batch : 19.379 ± 0.709M/s kprobe-multi-batch : 24.969 ± 0.226M/s kretprobe : 4.859 ± 0.027M/s kretprobe-multi : 4.940 ± 0.083M/s kretprobe-batch : 8.460 ± 0.044M/s kretprobe-multi-batch: 8.761 ± 0.282M/s > > > uprobe-base : 234.024 ą 0.225M/s > base : 16.383 ą 0.029M/s > tp : 9.973 ą 0.017M/s > rawtp : 14.889 ą 0.047M/s > kprobe : 10.970 ą 0.011M/s > kprobe-multi : 11.640 ą 0.009M/s > kretprobe : 6.667 ą 0.005M/s > kretprobe-multi : 6.704 ą 0.005M/s > fentry : 14.968 ą 0.024M/s > fexit : 11.860 ą 0.012M/s > fmodret : 12.656 ą 0.036M/s > kprobe-fast : 20.340 ą 0.043M/s > kprobe-multi-fast : 21.203 ą 0.019M/s > kretprobe-fast : 9.956 ą 0.021M/s > kretprobe-multi-fast: 5.611 ą 0.006M/s > fentry-fast : 46.825 ą 0.041M/s > fexit-fast : 19.746 ą 0.024M/s > > slightly better: > > kprobe-fast : 20.340 ą 0.043M/s > kprobe-multi-fast : 21.203 ą 0.019M/s > > still almost half perf: > kretprobe-fast : 9.956 ą 0.021M/s > kretprobe-multi-fast: 5.611 ą 0.006M/s > > jirka > > > > > Cc: Jiri Olsa <jolsa@xxxxxxxxxx> > > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx> > > --- > > tools/testing/selftests/bpf/bench.c | 18 +++ > > .../selftests/bpf/benchs/bench_trigger.c | 123 +++++++++++++++++- > > .../selftests/bpf/benchs/run_bench_trigger.sh | 8 +- > > .../selftests/bpf/progs/trigger_bench.c | 56 +++++++- > > 4 files changed, 201 insertions(+), 4 deletions(-) > > [...]