Each architecture generally implements fine-tuned checksum functions to leverage the instruction set. This patch adds the main checksum functions that are used in networking. Tested on QEMU, this series allows the CHECKSUM_KUNIT tests to complete an average of 50.9% faster. This patch takes heavy use of the Zbb extension using alternatives patching. To test this patch, enable the configs for KUNIT, then CHECKSUM_KUNIT. I have attempted to make these functions as optimal as possible, but I have not ran anything on actual riscv hardware. My performance testing has been limited to inspecting the assembly, running the algorithms on x86 hardware, and running in QEMU. ip_fast_csum is a relatively small function so even though it is possible to read 64 bits at a time on compatible hardware, the bottleneck becomes the clean up and setup code so loading 32 bits at a time is actually faster. Relies on https://lore.kernel.org/lkml/20230920193801.3035093-1-evan@xxxxxxxxxxxx/ --- The algorithm proposed to replace the default csum_fold can be seen to compute the same result by running all 2^32 possible inputs. static inline unsigned int ror32(unsigned int word, unsigned int shift) { return (word >> (shift & 31)) | (word << ((-shift) & 31)); } unsigned short csum_fold(unsigned int csum) { unsigned int sum = csum; sum = (sum & 0xffff) + (sum >> 16); sum = (sum & 0xffff) + (sum >> 16); return ~sum; } unsigned short csum_fold_arc(unsigned int csum) { return ((~csum - ror32(csum, 16)) >> 16); } int main() { unsigned int start = 0x0; do { if (csum_fold(start) != csum_fold_arc(start)) { printf("Not the same %u\n", start); return -1; } start += 1; } while(start != 0x0); printf("The same\n"); return 0; } Cc: Paul Walmsley <paul.walmsley@xxxxxxxxxx> Cc: Albert Ou <aou@xxxxxxxxxxxxxxxxx> Cc: Arnd Bergmann <arnd@xxxxxxxx> To: Charlie Jenkins <charlie@xxxxxxxxxxxx> To: Palmer Dabbelt <palmer@xxxxxxxxxxx> To: Conor Dooley <conor@xxxxxxxxxx> To: Samuel Holland <samuel.holland@xxxxxxxxxx> To: David Laight <David.Laight@xxxxxxxxxx> To: Xiao Wang <xiao.w.wang@xxxxxxxxx> To: Evan Green <evan@xxxxxxxxxxxx> To: Guo Ren <guoren@xxxxxxxxxx> To: linux-riscv@xxxxxxxxxxxxxxxxxxx To: linux-kernel@xxxxxxxxxxxxxxx To: linux-arch@xxxxxxxxxxxxxxx Signed-off-by: Charlie Jenkins <charlie@xxxxxxxxxxxx> --- Changes in v15: - Create modify_unaligned_access_branches to consolidate duplicate code (Evan) - Link to v14: https://lore.kernel.org/r/20231227-optimize_checksum-v14-0-ddfd48016566@xxxxxxxxxxxx Changes in v14: - Update misaligned static branch when CPUs are hotplugged (Guo) - Leave off Evan's reviewed-by on patch 2 since it was completely re-written - Link to v13: https://lore.kernel.org/r/20231220-optimize_checksum-v13-0-a73547e1cad8@xxxxxxxxxxxx Changes in v13: - Move cast from patch 4 to patch 3 - Link to v12: https://lore.kernel.org/r/20231212-optimize_checksum-v12-0-419a4ba6d666@xxxxxxxxxxxx Changes in v12: - Rebase onto 6.7-rc5 - Add performance stats in the cover letter - Link to v11: https://lore.kernel.org/r/20231117-optimize_checksum-v11-0-7d9d954fe361@xxxxxxxxxxxx Changes in v11: - Extensive modifications to comply to sparse - Organize include statements (Xiao) - Add csum_ipv6_magic to commit message (Xiao) - Remove extraneous len statement (Xiao) - Add kasan_check_read call (Xiao) - Improve comment field checksum.h (Xiao) - Consolidate "buff" and "len" into one parameter "end" (Xiao) - Link to v10: https://lore.kernel.org/r/20231101-optimize_checksum-v10-0-a498577bb969@xxxxxxxxxxxx Changes in v10: - Move tests that were riscv-specific to be arch agnostic (Arnd) - Link to v9: https://lore.kernel.org/r/20231031-optimize_checksum-v9-0-ea018e69b229@xxxxxxxxxxxx Changes in v9: - Use ror64 (Xiao) - Move do_csum and csum_ipv6_magic headers to patch 4 (Xiao) - Remove word "IP" from checksum headers (Xiao) - Swap to using ifndef CONFIG_32BIT instead of ifdef CONFIG_64BIT (Xiao) - Run no alignment code when buff is aligned (Xiao) - Consolidate two do_csum implementations overlap into do_csum_common - Link to v8: https://lore.kernel.org/r/20231027-optimize_checksum-v8-0-feb7101d128d@xxxxxxxxxxxx Changes in v8: - Speedups of 12% without Zbb and 21% with Zbb when cpu supports fast misaligned accesses for do_csum - Various formatting updates - Patch now relies on https://lore.kernel.org/lkml/20230920193801.3035093-1-evan@xxxxxxxxxxxx/ - Link to v7: https://lore.kernel.org/r/20230919-optimize_checksum-v7-0-06c7d0ddd5d6@xxxxxxxxxxxx Changes in v7: - Included linux/bitops.h in asm-generic/checksum.h to use ror (Conor) - Optimized loop in do_csum (David) - Used ror instead of shifting (David) - Unfortunately had to reintroduce ifdefs because gcc is not smart enough to not throw warnings on code that will never execute - Use ifdef instead of IS_ENABLED on __LITTLE_ENDIAN because IS_ENABLED does not work on that - Only optimize for zbb when alternatives is enabled in do_csum - Link to v6: https://lore.kernel.org/r/20230915-optimize_checksum-v6-0-14a6cf61c618@xxxxxxxxxxxx Changes in v6: - Fix accuracy of commit message for csum_fold - Fix indentation - Link to v5: https://lore.kernel.org/r/20230914-optimize_checksum-v5-0-c95b82a2757e@xxxxxxxxxxxx Changes in v5: - Drop vector patches - Check ZBB enabled before doing any ZBB code (Conor) - Check endianness in IS_ENABLED - Revert to the simpler non-tree based version of ipv6_csum_magic since David pointed out that the tree based version is not better. - Link to v4: https://lore.kernel.org/r/20230911-optimize_checksum-v4-0-77cc2ad9e9d7@xxxxxxxxxxxx Changes in v4: - Suggestion by David Laight to use an improved checksum used in arch/arc. - Eliminates zero-extension on rv32, but not on rv64. - Reduces data dependency which should improve execution speed on rv32 and rv64 - Still passes CHECKSUM_KUNIT and RISCV_CHECKSUM_KUNIT on rv32 and rv64 with and without zbb. - Link to v3: https://lore.kernel.org/r/20230907-optimize_checksum-v3-0-c502d34d9d73@xxxxxxxxxxxx Changes in v3: - Use riscv_has_extension_likely and has_vector where possible (Conor) - Reduce ifdefs by using IS_ENABLED where possible (Conor) - Use kernel_vector_begin in the vector code (Samuel) - Link to v2: https://lore.kernel.org/r/20230905-optimize_checksum-v2-0-ccd658db743b@xxxxxxxxxxxx Changes in v2: - After more benchmarking, rework functions to improve performance. - Remove tests that overlapped with the already existing checksum tests and make tests more extensive. - Use alternatives to activate code with Zbb and vector extensions - Link to v1: https://lore.kernel.org/r/20230826-optimize_checksum-v1-0-937501b4522a@xxxxxxxxxxxx --- Charlie Jenkins (5): asm-generic: Improve csum_fold riscv: Add static key for misaligned accesses riscv: Add checksum header riscv: Add checksum library kunit: Add tests for csum_ipv6_magic and ip_fast_csum arch/riscv/include/asm/checksum.h | 93 ++++++++++ arch/riscv/include/asm/cpufeature.h | 2 + arch/riscv/kernel/cpufeature.c | 90 +++++++++- arch/riscv/lib/Makefile | 1 + arch/riscv/lib/csum.c | 326 ++++++++++++++++++++++++++++++++++++ include/asm-generic/checksum.h | 6 +- lib/checksum_kunit.c | 284 ++++++++++++++++++++++++++++++- 7 files changed, 795 insertions(+), 7 deletions(-) --- base-commit: a39b6ac3781d46ba18193c9dbb2110f31e9bffe9 change-id: 20230804-optimize_checksum-db145288ac21 -- - Charlie