Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> writes: > On Tue, Oct 22, 2024 at 3:21 AM Puranjay Mohan <puranjay@xxxxxxxxxx> wrote: >> >> Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> writes: >> >> > On Mon, Oct 21, 2024 at 5:22 AM Puranjay Mohan <puranjay@xxxxxxxxxx> wrote: >> >> >> >> Add a microbenchmark for bpf_csum_diff() helper. This benchmark works by >> >> filling a 4KB buffer with random data and calculating the internet >> >> checksum on different parts of this buffer using bpf_csum_diff(). >> >> >> >> Example run using ./benchs/run_bench_csum_diff.sh on x86_64: >> >> >> >> [bpf]$ ./benchs/run_bench_csum_diff.sh >> >> 4 2.296 ± 0.066M/s (drops 0.000 ± 0.000M/s) >> >> 8 2.320 ± 0.003M/s (drops 0.000 ± 0.000M/s) >> >> 16 2.315 ± 0.001M/s (drops 0.000 ± 0.000M/s) >> >> 20 2.318 ± 0.001M/s (drops 0.000 ± 0.000M/s) >> >> 32 2.308 ± 0.003M/s (drops 0.000 ± 0.000M/s) >> >> 40 2.300 ± 0.029M/s (drops 0.000 ± 0.000M/s) >> >> 64 2.286 ± 0.001M/s (drops 0.000 ± 0.000M/s) >> >> 128 2.250 ± 0.001M/s (drops 0.000 ± 0.000M/s) >> >> 256 2.173 ± 0.001M/s (drops 0.000 ± 0.000M/s) >> >> 512 2.023 ± 0.055M/s (drops 0.000 ± 0.000M/s) >> > >> > you are not benchmarking bpf_csum_diff(), you are benchmarking how >> > often you can call bpf_prog_test_run(). Add some batching on the BPF >> > side, these numbers tell you that there is no difference between >> > calculating checksum for 4 bytes and for 512, that didn't seem strange >> > to you? >> >> This didn't seem strange to me because if you see the tables I added to >> the cover letter, there is a clear improvement after optimizing the >> helper and arm64 even shows a linear drop going from 4 bytes to 512 >> bytes, even after the optimization. >> > > Regardless of optimization, it's strange that throughput barely > differs when you vary the amount of work by more than 100x. This > wouldn't be strange if this checksum calculation was some sort of > cryptographic hash, where it's intentional to have the same timing > regardless of amount of work, or something along those lines. But I > don't think that's the case here. > > But as it is right now, this benchmark is benchmarking > bpf_prog_test_run(), as I mentioned, which seems to be bottlenecking > at about 2mln/s throughput for your machine. bpf_csum_diff()'s > overhead is trivial compared to bpf_prog_test_run() overhead and > syscall/context switch overhead. > > We shouldn't add the benchmark that doesn't benchmark the right thing. > So just add a bpf_for(i, 0, 100) loop doing bpf_csum_diff(), and then > do atomic increment *after* the loop (to minimize atomics overhead). Thanks, now I undestand what you meant. Will add the bpf_for() in the next version. Thanks, Puranjay
Attachment:
signature.asc
Description: PGP signature