On Tue, Nov 05, 2024 at 05:09:00PM +0100, Ard Biesheuvel wrote: > From: Ard Biesheuvel <ardb@xxxxxxxxxx> > > I realized that the generic sequence implementing 64x64 polynomial > multiply using 8x8 PMULL instructions, which is used in the CRC-T10DIF > code to implement a fallback version for cores that lack the 64x64 PMULL > instruction, is not very efficient. > > The folding coefficients that are used when processing the bulk of the > data are only 16 bits wide, and so 3/4 of the partial results of all those > 8x8->16 bit multiplications do not contribute anything to the end result. > > This means we can use a much faster implementation, producing a speedup > of 3.3x on Cortex-A72 without Crypto Extensions (Raspberry Pi 4). > > The same logic can be ported to 32-bit ARM too, where it produces a > speedup of 6.6x compared with the generic C implementation on the same > platform. > > Changes since v1: > - fix bug introduced in refactoring > - add asm comments to explain the fallback algorithm > - type 'u8 *out' parameter as 'u8 out[16]' > - avoid asm code for 16 byte inputs (a higher threshold might be more > appropriate but 16 is nonsensical given that the folding routine > returns a 16 byte output) > > Ard Biesheuvel (6): > crypto: arm64/crct10dif - Remove obsolete chunking logic > crypto: arm64/crct10dif - Use faster 16x64 bit polynomial multiply > crypto: arm64/crct10dif - Remove remaining 64x64 PMULL fallback code > crypto: arm/crct10dif - Use existing mov_l macro instead of __adrl > crypto: arm/crct10dif - Macroify PMULL asm code > crypto: arm/crct10dif - Implement plain NEON variant > > arch/arm/crypto/crct10dif-ce-core.S | 249 ++++++++++----- > arch/arm/crypto/crct10dif-ce-glue.c | 55 +++- > arch/arm64/crypto/crct10dif-ce-core.S | 335 +++++++++----------- > arch/arm64/crypto/crct10dif-ce-glue.c | 48 ++- > 4 files changed, 376 insertions(+), 311 deletions(-) Reviewed-by: Eric Biggers <ebiggers@xxxxxxxxxx> - Eric