Puranjay Mohan <puranjay@xxxxxxxxxx> writes: > Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> writes: > >> On Tue, Oct 22, 2024 at 3:21 AM Puranjay Mohan <puranjay@xxxxxxxxxx> wrote: >>> >>> Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> writes: >>> >>> > On Mon, Oct 21, 2024 at 5:22 AM Puranjay Mohan <puranjay@xxxxxxxxxx> wrote: >>> >> >>> >> Add a microbenchmark for bpf_csum_diff() helper. This benchmark works by >>> >> filling a 4KB buffer with random data and calculating the internet >>> >> checksum on different parts of this buffer using bpf_csum_diff(). >>> >> >>> >> Example run using ./benchs/run_bench_csum_diff.sh on x86_64: >>> >> >>> >> [bpf]$ ./benchs/run_bench_csum_diff.sh >>> >> 4 2.296 ± 0.066M/s (drops 0.000 ± 0.000M/s) >>> >> 8 2.320 ± 0.003M/s (drops 0.000 ± 0.000M/s) >>> >> 16 2.315 ± 0.001M/s (drops 0.000 ± 0.000M/s) >>> >> 20 2.318 ± 0.001M/s (drops 0.000 ± 0.000M/s) >>> >> 32 2.308 ± 0.003M/s (drops 0.000 ± 0.000M/s) >>> >> 40 2.300 ± 0.029M/s (drops 0.000 ± 0.000M/s) >>> >> 64 2.286 ± 0.001M/s (drops 0.000 ± 0.000M/s) >>> >> 128 2.250 ± 0.001M/s (drops 0.000 ± 0.000M/s) >>> >> 256 2.173 ± 0.001M/s (drops 0.000 ± 0.000M/s) >>> >> 512 2.023 ± 0.055M/s (drops 0.000 ± 0.000M/s) >>> > >>> > you are not benchmarking bpf_csum_diff(), you are benchmarking how >>> > often you can call bpf_prog_test_run(). Add some batching on the BPF >>> > side, these numbers tell you that there is no difference between >>> > calculating checksum for 4 bytes and for 512, that didn't seem strange >>> > to you? >>> >>> This didn't seem strange to me because if you see the tables I added to >>> the cover letter, there is a clear improvement after optimizing the >>> helper and arm64 even shows a linear drop going from 4 bytes to 512 >>> bytes, even after the optimization. >>> >> >> Regardless of optimization, it's strange that throughput barely >> differs when you vary the amount of work by more than 100x. This >> wouldn't be strange if this checksum calculation was some sort of >> cryptographic hash, where it's intentional to have the same timing >> regardless of amount of work, or something along those lines. But I >> don't think that's the case here. >> >> But as it is right now, this benchmark is benchmarking >> bpf_prog_test_run(), as I mentioned, which seems to be bottlenecking >> at about 2mln/s throughput for your machine. bpf_csum_diff()'s >> overhead is trivial compared to bpf_prog_test_run() overhead and >> syscall/context switch overhead. >> >> We shouldn't add the benchmark that doesn't benchmark the right thing. >> So just add a bpf_for(i, 0, 100) loop doing bpf_csum_diff(), and then >> do atomic increment *after* the loop (to minimize atomics overhead). > > Thanks, now I undestand what you meant. Will add the bpf_for() in the > next version. I have decided to drop this patch as even after adding bpf_for() the difference between 4B and 512B is not that much. So, benchmarking bpf_csum_diff() using this triggering based framework is not useful. So, v2 will not have this patch but the cover letter will still have the tables to show the difference before/after the optimization. Thanks, Puranjay
Attachment:
signature.asc
Description: PGP signature