Re: [PATCH bpf-next 4/5] selftests/bpf: Add benchmark for bpf_csum_diff() helper

Puranjay Mohan <puranjay@xxxxxxxxxx> · Tue, 22 Oct 2024 17:58:14 +0000

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> writes:

> On Tue, Oct 22, 2024 at 3:21 AM Puranjay Mohan <puranjay@xxxxxxxxxx> wrote:
>>
>> Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> writes:
>>
>> > On Mon, Oct 21, 2024 at 5:22 AM Puranjay Mohan <puranjay@xxxxxxxxxx> wrote:
>> >>
>> >> Add a microbenchmark for bpf_csum_diff() helper. This benchmark works by
>> >> filling a 4KB buffer with random data and calculating the internet
>> >> checksum on different parts of this buffer using bpf_csum_diff().
>> >>
>> >> Example run using ./benchs/run_bench_csum_diff.sh on x86_64:
>> >>
>> >> [bpf]$ ./benchs/run_bench_csum_diff.sh
>> >> 4                    2.296 ± 0.066M/s (drops 0.000 ± 0.000M/s)
>> >> 8                    2.320 ± 0.003M/s (drops 0.000 ± 0.000M/s)
>> >> 16                   2.315 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>> >> 20                   2.318 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>> >> 32                   2.308 ± 0.003M/s (drops 0.000 ± 0.000M/s)
>> >> 40                   2.300 ± 0.029M/s (drops 0.000 ± 0.000M/s)
>> >> 64                   2.286 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>> >> 128                  2.250 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>> >> 256                  2.173 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>> >> 512                  2.023 ± 0.055M/s (drops 0.000 ± 0.000M/s)
>> >
>> > you are not benchmarking bpf_csum_diff(), you are benchmarking how
>> > often you can call bpf_prog_test_run(). Add some batching on the BPF
>> > side, these numbers tell you that there is no difference between
>> > calculating checksum for 4 bytes and for 512, that didn't seem strange
>> > to you?
>>
>> This didn't seem strange to me because if you see the tables I added to
>> the cover letter, there is a clear improvement after optimizing the
>> helper and arm64 even shows a linear drop going from 4 bytes to 512
>> bytes, even after the optimization.
>>
>
> Regardless of optimization, it's strange that throughput barely
> differs when you vary the amount of work by more than 100x. This
> wouldn't be strange if this checksum calculation was some sort of
> cryptographic hash, where it's intentional to have the same timing
> regardless of amount of work, or something along those lines. But I
> don't think that's the case here.
>
> But as it is right now, this benchmark is benchmarking
> bpf_prog_test_run(), as I mentioned, which seems to be bottlenecking
> at about 2mln/s throughput for your machine. bpf_csum_diff()'s
> overhead is trivial compared to bpf_prog_test_run() overhead and
> syscall/context switch overhead.
>
> We shouldn't add the benchmark that doesn't benchmark the right thing.
> So just add a bpf_for(i, 0, 100) loop doing bpf_csum_diff(), and then
> do atomic increment *after* the loop (to minimize atomics overhead).

Thanks, now I undestand what you meant. Will add the bpf_for() in the
next version.

Thanks,
Puranjay
Attachment:
signature.asc

Description: PGP signature