On Tue, Oct 22, 2024 at 3:21 AM Puranjay Mohan <puranjay@xxxxxxxxxx> wrote: > > Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> writes: > > > On Mon, Oct 21, 2024 at 5:22 AM Puranjay Mohan <puranjay@xxxxxxxxxx> wrote: > >> > >> Add a microbenchmark for bpf_csum_diff() helper. This benchmark works by > >> filling a 4KB buffer with random data and calculating the internet > >> checksum on different parts of this buffer using bpf_csum_diff(). > >> > >> Example run using ./benchs/run_bench_csum_diff.sh on x86_64: > >> > >> [bpf]$ ./benchs/run_bench_csum_diff.sh > >> 4 2.296 ± 0.066M/s (drops 0.000 ± 0.000M/s) > >> 8 2.320 ± 0.003M/s (drops 0.000 ± 0.000M/s) > >> 16 2.315 ± 0.001M/s (drops 0.000 ± 0.000M/s) > >> 20 2.318 ± 0.001M/s (drops 0.000 ± 0.000M/s) > >> 32 2.308 ± 0.003M/s (drops 0.000 ± 0.000M/s) > >> 40 2.300 ± 0.029M/s (drops 0.000 ± 0.000M/s) > >> 64 2.286 ± 0.001M/s (drops 0.000 ± 0.000M/s) > >> 128 2.250 ± 0.001M/s (drops 0.000 ± 0.000M/s) > >> 256 2.173 ± 0.001M/s (drops 0.000 ± 0.000M/s) > >> 512 2.023 ± 0.055M/s (drops 0.000 ± 0.000M/s) > > > > you are not benchmarking bpf_csum_diff(), you are benchmarking how > > often you can call bpf_prog_test_run(). Add some batching on the BPF > > side, these numbers tell you that there is no difference between > > calculating checksum for 4 bytes and for 512, that didn't seem strange > > to you? > > This didn't seem strange to me because if you see the tables I added to > the cover letter, there is a clear improvement after optimizing the > helper and arm64 even shows a linear drop going from 4 bytes to 512 > bytes, even after the optimization. > Regardless of optimization, it's strange that throughput barely differs when you vary the amount of work by more than 100x. This wouldn't be strange if this checksum calculation was some sort of cryptographic hash, where it's intentional to have the same timing regardless of amount of work, or something along those lines. But I don't think that's the case here. But as it is right now, this benchmark is benchmarking bpf_prog_test_run(), as I mentioned, which seems to be bottlenecking at about 2mln/s throughput for your machine. bpf_csum_diff()'s overhead is trivial compared to bpf_prog_test_run() overhead and syscall/context switch overhead. We shouldn't add the benchmark that doesn't benchmark the right thing. So just add a bpf_for(i, 0, 100) loop doing bpf_csum_diff(), and then do atomic increment *after* the loop (to minimize atomics overhead). > On x86 after the improvement, 4 bytes and 512 bytes show similar numbers > but there is still a small drop that can be seen going from 4 to 512 > bytes. > > My thought was that because the bpf_csum_diff() calls csum_partial() on > x86 which is already optimised, most of the overhead was due to copying > the buffer which is now removed. > > I guess I can amplify the difference between 4B and 512B by calling > bpf_csum_diff() multiple times in a loop, or by calculating the csum by > dividing the buffer into more parts (currently the BPF code divides it > into 2 parts only). > > >> > >> Signed-off-by: Puranjay Mohan <puranjay@xxxxxxxxxx> > >> --- > >> tools/testing/selftests/bpf/Makefile | 2 + > >> tools/testing/selftests/bpf/bench.c | 4 + > >> .../selftests/bpf/benchs/bench_csum_diff.c | 164 ++++++++++++++++++ > >> .../bpf/benchs/run_bench_csum_diff.sh | 10 ++ > >> .../selftests/bpf/progs/csum_diff_bench.c | 25 +++ > >> 5 files changed, 205 insertions(+) > >> create mode 100644 tools/testing/selftests/bpf/benchs/bench_csum_diff.c > >> create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_csum_diff.sh > >> create mode 100644 tools/testing/selftests/bpf/progs/csum_diff_bench.c > >> > > [...]