Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> writes: > On Mon, Oct 21, 2024 at 5:22 AM Puranjay Mohan <puranjay@xxxxxxxxxx> wrote: >> >> Add a microbenchmark for bpf_csum_diff() helper. This benchmark works by >> filling a 4KB buffer with random data and calculating the internet >> checksum on different parts of this buffer using bpf_csum_diff(). >> >> Example run using ./benchs/run_bench_csum_diff.sh on x86_64: >> >> [bpf]$ ./benchs/run_bench_csum_diff.sh >> 4 2.296 ± 0.066M/s (drops 0.000 ± 0.000M/s) >> 8 2.320 ± 0.003M/s (drops 0.000 ± 0.000M/s) >> 16 2.315 ± 0.001M/s (drops 0.000 ± 0.000M/s) >> 20 2.318 ± 0.001M/s (drops 0.000 ± 0.000M/s) >> 32 2.308 ± 0.003M/s (drops 0.000 ± 0.000M/s) >> 40 2.300 ± 0.029M/s (drops 0.000 ± 0.000M/s) >> 64 2.286 ± 0.001M/s (drops 0.000 ± 0.000M/s) >> 128 2.250 ± 0.001M/s (drops 0.000 ± 0.000M/s) >> 256 2.173 ± 0.001M/s (drops 0.000 ± 0.000M/s) >> 512 2.023 ± 0.055M/s (drops 0.000 ± 0.000M/s) > > you are not benchmarking bpf_csum_diff(), you are benchmarking how > often you can call bpf_prog_test_run(). Add some batching on the BPF > side, these numbers tell you that there is no difference between > calculating checksum for 4 bytes and for 512, that didn't seem strange > to you? This didn't seem strange to me because if you see the tables I added to the cover letter, there is a clear improvement after optimizing the helper and arm64 even shows a linear drop going from 4 bytes to 512 bytes, even after the optimization. On x86 after the improvement, 4 bytes and 512 bytes show similar numbers but there is still a small drop that can be seen going from 4 to 512 bytes. My thought was that because the bpf_csum_diff() calls csum_partial() on x86 which is already optimised, most of the overhead was due to copying the buffer which is now removed. I guess I can amplify the difference between 4B and 512B by calling bpf_csum_diff() multiple times in a loop, or by calculating the csum by dividing the buffer into more parts (currently the BPF code divides it into 2 parts only). >> >> Signed-off-by: Puranjay Mohan <puranjay@xxxxxxxxxx> >> --- >> tools/testing/selftests/bpf/Makefile | 2 + >> tools/testing/selftests/bpf/bench.c | 4 + >> .../selftests/bpf/benchs/bench_csum_diff.c | 164 ++++++++++++++++++ >> .../bpf/benchs/run_bench_csum_diff.sh | 10 ++ >> .../selftests/bpf/progs/csum_diff_bench.c | 25 +++ >> 5 files changed, 205 insertions(+) >> create mode 100644 tools/testing/selftests/bpf/benchs/bench_csum_diff.c >> create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_csum_diff.sh >> create mode 100644 tools/testing/selftests/bpf/progs/csum_diff_bench.c >> > > [...] > >> + >> +static void csum_diff_setup(void) >> +{ >> + int err; >> + char *buff; >> + size_t i, sz; >> + >> + sz = sizeof(ctx.skel->rodata->buff); >> + >> + setup_libbpf(); >> + >> + ctx.skel = csum_diff_bench__open(); >> + if (!ctx.skel) { >> + fprintf(stderr, "failed to open skeleton\n"); >> + exit(1); >> + } >> + >> + srandom(time(NULL)); >> + buff = ctx.skel->rodata->buff; >> + >> + /* >> + * Set first 8 bytes of buffer to 0xdeadbeefdeadbeef, this is later used to verify the >> + * correctness of the helper by comparing the checksum result for 0xdeadbeefdeadbeef that >> + * should be 0x3b3b >> + */ >> + >> + *(u64 *)buff = 0xdeadbeefdeadbeef; >> + >> + for (i = 8; i < sz; i++) >> + buff[i] = '1' + random() % 9; > > so, you only generate 9 different values for bytes, why? Why not full > byte range? Thanks for catching this, there is no reason for this to be [1,10] I will use the full byte range in the next version. Thanks, Puranjay
Attachment:
signature.asc
Description: PGP signature