On Wed, 16 Oct 2024 at 05:03, Eric Biggers <ebiggers@xxxxxxxxxx> wrote: > > On Tue, Oct 15, 2024 at 12:41:40PM +0200, Ard Biesheuvel wrote: > > From: Ard Biesheuvel <ardb@xxxxxxxxxx> > > > > Now that kernel mode NEON no longer disables preemption, using FP/SIMD > > in library code which is not obviously part of the crypto subsystem is > > no longer problematic, as it will no longer incur unexpected latencies. > > > > So accelerate the CRC-32 library code on arm64 to use a 4-way > > interleave, using PMULL instructions to implement the folding. > > > > On Apple M2, this results in a speedup of 2 - 2.8x when using input > > sizes of 1k - 8k. For smaller sizes, the overhead of preserving and > > restoring the FP/SIMD register file may not be worth it, so 1k is used > > as a threshold for choosing this code path. > > > > The coefficient tables were generated using code provided by Eric. [0] > > > > [0] https://github.com/ebiggers/libdeflate/blob/master/scripts/gen_crc32_multipliers.c > > > > Cc: Eric Biggers <ebiggers@xxxxxxxxxx> > > Signed-off-by: Ard Biesheuvel <ardb@xxxxxxxxxx> > > --- > > arch/arm64/lib/Makefile | 2 +- > > arch/arm64/lib/crc32-glue.c | 36 +++ > > arch/arm64/lib/crc32-pmull.S | 240 ++++++++++++++++++++ > > 3 files changed, 277 insertions(+), 1 deletion(-) > > Thanks for doing this! The new code looks good to me. 4-way does seem like the > right choice for arm64. > Agreed. > I'd recommend calling the file crc32-4way.S and the functions > crc32*_arm64_4way(), rather than crc32-pmull.S and crc32*_pmull(). This would > avoid confusion with a CRC implementation that is actually based entirely on > pmull (which is possible). I'm well aware :-) commit 8fefde90e90c9f5c2770e46ceb127813d3f20c34 Author: Ard Biesheuvel <ardb@xxxxxxxxxx> Date: Mon Dec 5 18:42:27 2016 +0000 crypto: arm64/crc32 - accelerated support based on x86 SSE implementation commit 598b7d41e544322c8c4f3737ee8ddf905a44175e Author: Ard Biesheuvel <ardb@xxxxxxxxxx> Date: Mon Aug 27 13:02:45 2018 +0200 crypto: arm64/crc32 - remove PMULL based CRC32 driver I removed it because it wasn't actually faster, although that might be different on modern cores. > The proposed implementation uses the crc32 > instructions to do most of the work and only uses pmull for combining the CRCs. > Yes, crc32c-pcl-intel-asm_64.S made this same mistake, but it is a mistake, IMO. > Yeah good point.