On Sat, May 9, 2020 at 10:24 AM Yonghong Song <yhs@xxxxxx> wrote: > > > > On 5/8/20 4:20 PM, Andrii Nakryiko wrote: > > Add fmod_ret BPF program to existing test_overhead selftest. Also re-implement > > user-space benchmarking part into benchmark runner to compare results. Results > > with ./bench are consistently somewhat lower than test_overhead's, but relative > > performance of various types of BPF programs stay consisten (e.g., kretprobe is > > noticeably slower). > > > > run_bench_rename.sh script (in benchs/ directory) was used to produce the > > following numbers: > > > > base : 3.975 ± 0.065M/s > > kprobe : 3.268 ± 0.095M/s > > kretprobe : 2.496 ± 0.040M/s > > rawtp : 3.899 ± 0.078M/s > > fentry : 3.836 ± 0.049M/s > > fexit : 3.660 ± 0.082M/s > > fmodret : 3.776 ± 0.033M/s > > > > While running test_overhead gives: > > > > task_rename base 4457K events per sec > > task_rename kprobe 3849K events per sec > > task_rename kretprobe 2729K events per sec > > task_rename raw_tp 4506K events per sec > > task_rename fentry 4381K events per sec > > task_rename fexit 4349K events per sec > > task_rename fmod_ret 4130K events per sec > > Do you where the overhead is and how we could provide options in > bench to reduce the overhead so we can achieve similar numbers? > For benchmarking, sometimes you really want to see "true" > potential of a particular implementation. Alright, let's make it an official bench-off... :) And the reason for this discrepancy, turns out to be... not atomics at all! But rather a single-threaded vs multi-threaded process (well, at least task_rename happening from non-main thread, I didn't narrow it down further). Atomics actually make very little difference, which gives me a good peace of mind :) So, I've built and ran test_overhead (selftest) and bench both as multi-threaded and single-threaded apps. Corresponding results match almost perfectly. And that's while test_overhead doesn't use atomics at all, while bench still does. Then I also ran test_overhead with added generics to match bench implementation. There are barely any differences, see two last sets of results. BTW, selftest results seems bit lower from the ones in original commit, probably because I made it run more iterations (like 40 times more) to have more stable results. So here are the results: Single-threaded implementations =============================== /* bench: single-threaded, atomics */ base : 4.622 ± 0.049M/s kprobe : 3.673 ± 0.052M/s kretprobe : 2.625 ± 0.052M/s rawtp : 4.369 ± 0.089M/s fentry : 4.201 ± 0.558M/s fexit : 4.309 ± 0.148M/s fmodret : 4.314 ± 0.203M/s /* selftest: single-threaded, no atomics */ task_rename base 4555K events per sec task_rename kprobe 3643K events per sec task_rename kretprobe 2506K events per sec task_rename raw_tp 4303K events per sec task_rename fentry 4307K events per sec task_rename fexit 4010K events per sec task_rename fmod_ret 3984K events per sec Multi-threaded implementations ============================== /* bench: multi-threaded w/ atomics */ base : 3.910 ± 0.023M/s kprobe : 3.048 ± 0.037M/s kretprobe : 2.300 ± 0.015M/s rawtp : 3.687 ± 0.034M/s fentry : 3.740 ± 0.087M/s fexit : 3.510 ± 0.009M/s fmodret : 3.485 ± 0.050M/s /* selftest: multi-threaded w/ atomics */ task_rename base 3872K events per sec task_rename kprobe 3068K events per sec task_rename kretprobe 2350K events per sec task_rename raw_tp 3731K events per sec task_rename fentry 3639K events per sec task_rename fexit 3558K events per sec task_rename fmod_ret 3511K events per sec /* selftest: multi-threaded, no atomics */ task_rename base 3945K events per sec task_rename kprobe 3298K events per sec task_rename kretprobe 2451K events per sec task_rename raw_tp 3718K events per sec task_rename fentry 3782K events per sec task_rename fexit 3543K events per sec task_rename fmod_ret 3526K events per sec > > > > > Acked-by: John Fastabend <john.fastabend@xxxxxxxxx> > > Signed-off-by: Andrii Nakryiko <andriin@xxxxxx> > > --- > > tools/testing/selftests/bpf/Makefile | 4 +- > > tools/testing/selftests/bpf/bench.c | 14 ++ > > .../selftests/bpf/benchs/bench_rename.c | 195 ++++++++++++++++++ > > .../selftests/bpf/benchs/run_bench_rename.sh | 9 + > > .../selftests/bpf/prog_tests/test_overhead.c | 14 +- > > .../selftests/bpf/progs/test_overhead.c | 6 + > > 6 files changed, 240 insertions(+), 2 deletions(-) > > create mode 100644 tools/testing/selftests/bpf/benchs/bench_rename.c > > create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_rename.sh > > > > diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile > > index 289fffbf975e..29a02abf81a3 100644 > > --- a/tools/testing/selftests/bpf/Makefile > > +++ b/tools/testing/selftests/bpf/Makefile > > @@ -409,10 +409,12 @@ $(OUTPUT)/test_cpp: test_cpp.cpp $(OUTPUT)/test_core_extern.skel.h $(BPFOBJ) > > $(OUTPUT)/bench_%.o: benchs/bench_%.c bench.h > > $(call msg,CC,,$@) > > $(CC) $(CFLAGS) -c $(filter %.c,$^) $(LDLIBS) -o $@ > > +$(OUTPUT)/bench_rename.o: $(OUTPUT)/test_overhead.skel.h > > $(OUTPUT)/bench.o: bench.h > > $(OUTPUT)/bench: LDLIBS += -lm > > $(OUTPUT)/bench: $(OUTPUT)/bench.o \ > > - $(OUTPUT)/bench_count.o > > + $(OUTPUT)/bench_count.o \ > > + $(OUTPUT)/bench_rename.o > > $(call msg,BINARY,,$@) > > $(CC) $(LDFLAGS) -o $@ $(filter %.a %.o,$^) $(LDLIBS) > > > [...]