On Tue, Oct 15, 2024 at 12:41:40PM +0200, Ard Biesheuvel wrote: > From: Ard Biesheuvel <ardb@xxxxxxxxxx> > > Now that kernel mode NEON no longer disables preemption, using FP/SIMD > in library code which is not obviously part of the crypto subsystem is > no longer problematic, as it will no longer incur unexpected latencies. > > So accelerate the CRC-32 library code on arm64 to use a 4-way > interleave, using PMULL instructions to implement the folding. > > On Apple M2, this results in a speedup of 2 - 2.8x when using input > sizes of 1k - 8k. For smaller sizes, the overhead of preserving and > restoring the FP/SIMD register file may not be worth it, so 1k is used > as a threshold for choosing this code path. > > The coefficient tables were generated using code provided by Eric. [0] > > [0] https://github.com/ebiggers/libdeflate/blob/master/scripts/gen_crc32_multipliers.c > > Cc: Eric Biggers <ebiggers@xxxxxxxxxx> > Signed-off-by: Ard Biesheuvel <ardb@xxxxxxxxxx> > --- > arch/arm64/lib/Makefile | 2 +- > arch/arm64/lib/crc32-glue.c | 36 +++ > arch/arm64/lib/crc32-pmull.S | 240 ++++++++++++++++++++ > 3 files changed, 277 insertions(+), 1 deletion(-) Thanks for doing this! The new code looks good to me. 4-way does seem like the right choice for arm64. I'd recommend calling the file crc32-4way.S and the functions crc32*_arm64_4way(), rather than crc32-pmull.S and crc32*_pmull(). This would avoid confusion with a CRC implementation that is actually based entirely on pmull (which is possible). The proposed implementation uses the crc32 instructions to do most of the work and only uses pmull for combining the CRCs. Yes, crc32c-pcl-intel-asm_64.S made this same mistake, but it is a mistake, IMO. - Eric