The bpf_csum_diff() helper currently returns different values on different architectures because it calls csum_partial() that is either implemented by the architecture like x86_64, arm, etc or uses the generic implementation in lib/checksum.c like arm64, riscv, etc. The implementation in lib/checksum.c returns the folded result that is 16-bit long, but the architecture specific implementation can return an unfolded value that is larger than 16-bits. The helper uses a per-cpu scratchpad buffer for copying the data and then computing the csum on this buffer. This can be optimised by utilising some mathematical properties of csum. The patch 1 in this series does preparatory work for homogenizing the helper. patch 2 does the changes to the helper itself. The performance gain can be seen in the tables below that are generated using the benchmark built in patch 4: x86-64: +-------------+------------------+------------------+-------------+ | Buffer Size | Before | After | Improvement | +-------------+------------------+------------------+-------------+ | 4 | 2.296 ± 0.066M/s | 3.415 ± 0.001M/s | 48.73 % | | 8 | 2.320 ± 0.003M/s | 3.409 ± 0.003M/s | 46.93 % | | 16 | 2.315 ± 0.001M/s | 3.414 ± 0.003M/s | 47.47 % | | 20 | 2.318 ± 0.001M/s | 3.416 ± 0.001M/s | 47.36 % | | 32 | 2.308 ± 0.003M/s | 3.413 ± 0.003M/s | 47.87 % | | 40 | 2.300 ± 0.029M/s | 3.413 ± 0.003M/s | 48.39 % | | 64 | 2.286 ± 0.001M/s | 3.410 ± 0.001M/s | 49.16 % | | 128 | 2.250 ± 0.001M/s | 3.404 ± 0.001M/s | 51.28 % | | 256 | 2.173 ± 0.001M/s | 3.383 ± 0.001M/s | 55.68 % | | 512 | 2.023 ± 0.055M/s | 3.340 ± 0.001M/s | 65.10 % | +-------------+------------------+------------------+-------------+ ARM64: +-------------+------------------+------------------+-------------+ | Buffer Size | Before | After | Improvement | +-------------+------------------+------------------+-------------+ | 4 | 1.397 ± 0.005M/s | 1.493 ± 0.005M/s | 6.87 % | | 8 | 1.402 ± 0.002M/s | 1.489 ± 0.002M/s | 6.20 % | | 16 | 1.391 ± 0.001M/s | 1.481 ± 0.001M/s | 6.47 % | | 20 | 1.379 ± 0.001M/s | 1.477 ± 0.001M/s | 7.10 % | | 32 | 1.358 ± 0.001M/s | 1.469 ± 0.002M/s | 8.17 % | | 40 | 1.339 ± 0.001M/s | 1.462 ± 0.002M/s | 9.18 % | | 64 | 1.302 ± 0.002M/s | 1.449 ± 0.003M/s | 11.29 % | | 128 | 1.214 ± 0.001M/s | 1.443 ± 0.003M/s | 18.86 % | | 256 | 1.080 ± 0.001M/s | 1.423 ± 0.001M/s | 31.75 % | | 512 | 0.887 ± 0.001M/s | 1.411 ± 0.002M/s | 59.07 % | +-------------+------------------+------------------+-------------+ Patch 5 adds a selftest for this helper to verify the results produced by this helper in multiple modes and edge cases. Patch 3 reverts a hack that was done to make the selftest pass on all architectures. Puranjay Mohan (5): net: checksum: move from32to16() to generic header bpf: bpf_csum_diff: optimize and homogenize for all archs selftests/bpf: don't mask result of bpf_csum_diff() in test_verifier selftests/bpf: Add benchmark for bpf_csum_diff() helper selftests/bpf: Add a selftest for bpf_csum_diff() arch/parisc/lib/checksum.c | 13 +- include/net/checksum.h | 6 + lib/checksum.c | 11 +- net/core/filter.c | 37 +- tools/testing/selftests/bpf/Makefile | 2 + tools/testing/selftests/bpf/bench.c | 4 + .../selftests/bpf/benchs/bench_csum_diff.c | 164 +++++++ .../bpf/benchs/run_bench_csum_diff.sh | 10 + .../selftests/bpf/prog_tests/test_csum_diff.c | 408 ++++++++++++++++++ .../selftests/bpf/progs/csum_diff_bench.c | 25 ++ .../selftests/bpf/progs/csum_diff_test.c | 42 ++ .../bpf/progs/verifier_array_access.c | 3 +- 12 files changed, 674 insertions(+), 51 deletions(-) create mode 100644 tools/testing/selftests/bpf/benchs/bench_csum_diff.c create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_csum_diff.sh create mode 100644 tools/testing/selftests/bpf/prog_tests/test_csum_diff.c create mode 100644 tools/testing/selftests/bpf/progs/csum_diff_bench.c create mode 100644 tools/testing/selftests/bpf/progs/csum_diff_test.c -- 2.40.1