On Mon, Oct 21, 2024 at 5:22 AM Puranjay Mohan <puranjay@xxxxxxxxxx> wrote: > > Add a microbenchmark for bpf_csum_diff() helper. This benchmark works by > filling a 4KB buffer with random data and calculating the internet > checksum on different parts of this buffer using bpf_csum_diff(). > > Example run using ./benchs/run_bench_csum_diff.sh on x86_64: > > [bpf]$ ./benchs/run_bench_csum_diff.sh > 4 2.296 ± 0.066M/s (drops 0.000 ± 0.000M/s) > 8 2.320 ± 0.003M/s (drops 0.000 ± 0.000M/s) > 16 2.315 ± 0.001M/s (drops 0.000 ± 0.000M/s) > 20 2.318 ± 0.001M/s (drops 0.000 ± 0.000M/s) > 32 2.308 ± 0.003M/s (drops 0.000 ± 0.000M/s) > 40 2.300 ± 0.029M/s (drops 0.000 ± 0.000M/s) > 64 2.286 ± 0.001M/s (drops 0.000 ± 0.000M/s) > 128 2.250 ± 0.001M/s (drops 0.000 ± 0.000M/s) > 256 2.173 ± 0.001M/s (drops 0.000 ± 0.000M/s) > 512 2.023 ± 0.055M/s (drops 0.000 ± 0.000M/s) you are not benchmarking bpf_csum_diff(), you are benchmarking how often you can call bpf_prog_test_run(). Add some batching on the BPF side, these numbers tell you that there is no difference between calculating checksum for 4 bytes and for 512, that didn't seem strange to you? pw-bot: cr > > Signed-off-by: Puranjay Mohan <puranjay@xxxxxxxxxx> > --- > tools/testing/selftests/bpf/Makefile | 2 + > tools/testing/selftests/bpf/bench.c | 4 + > .../selftests/bpf/benchs/bench_csum_diff.c | 164 ++++++++++++++++++ > .../bpf/benchs/run_bench_csum_diff.sh | 10 ++ > .../selftests/bpf/progs/csum_diff_bench.c | 25 +++ > 5 files changed, 205 insertions(+) > create mode 100644 tools/testing/selftests/bpf/benchs/bench_csum_diff.c > create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_csum_diff.sh > create mode 100644 tools/testing/selftests/bpf/progs/csum_diff_bench.c > [...] > + > +static void csum_diff_setup(void) > +{ > + int err; > + char *buff; > + size_t i, sz; > + > + sz = sizeof(ctx.skel->rodata->buff); > + > + setup_libbpf(); > + > + ctx.skel = csum_diff_bench__open(); > + if (!ctx.skel) { > + fprintf(stderr, "failed to open skeleton\n"); > + exit(1); > + } > + > + srandom(time(NULL)); > + buff = ctx.skel->rodata->buff; > + > + /* > + * Set first 8 bytes of buffer to 0xdeadbeefdeadbeef, this is later used to verify the > + * correctness of the helper by comparing the checksum result for 0xdeadbeefdeadbeef that > + * should be 0x3b3b > + */ > + > + *(u64 *)buff = 0xdeadbeefdeadbeef; > + > + for (i = 8; i < sz; i++) > + buff[i] = '1' + random() % 9; so, you only generate 9 different values for bytes, why? Why not full byte range? > + > + ctx.skel->rodata->buff_len = args.buff_len; > + > + err = csum_diff_bench__load(ctx.skel); > + if (err) { > + fprintf(stderr, "failed to load skeleton\n"); > + csum_diff_bench__destroy(ctx.skel); > + exit(1); > + } > +} > + [...]