On Mon, Oct 28, 2024 at 08:02:14PM +0100, Ard Biesheuvel wrote: > From: Ard Biesheuvel <ardb@xxxxxxxxxx> > > The CRC-T10DIF algorithm produces a 16-bit CRC, and this is reflected in > the folding coefficients, which are also only 16 bits wide. > > This means that the polynomial multiplications involving these > coefficients can be performed using 8-bit long polynomial multiplication > (8x8 -> 16) in only a few steps, and this is an instruction that is part > of the base NEON ISA, which is all most real ARMv7 cores implement. (The > 64-bit PMULL instruction is part of the crypto extensions, which are > only implemented by 64-bit cores) > > The final reduction is a bit more involved, but we can delegate that to > the generic CRC-T10DIF implementation after folding the entire input > into a 16 byte vector. > > This results in a speedup of around 6.6x on Cortex-A72 running in 32-bit > mode. > > Signed-off-by: Ard Biesheuvel <ardb@xxxxxxxxxx> > --- > arch/arm/crypto/crct10dif-ce-core.S | 50 ++++++++++++++++++-- > arch/arm/crypto/crct10dif-ce-glue.c | 44 +++++++++++++++-- > 2 files changed, 85 insertions(+), 9 deletions(-) > > diff --git a/arch/arm/crypto/crct10dif-ce-core.S b/arch/arm/crypto/crct10dif-ce-core.S > index 6b72167574b2..5e103a9a42dd 100644 > --- a/arch/arm/crypto/crct10dif-ce-core.S > +++ b/arch/arm/crypto/crct10dif-ce-core.S > @@ -112,6 +112,34 @@ > FOLD_CONST_L .req q10l > FOLD_CONST_H .req q10h > > +__pmull16x64_p8: > + vmull.p8 q13, d23, d24 > + vmull.p8 q14, d23, d25 > + vmull.p8 q15, d22, d24 > + vmull.p8 q12, d22, d25 > + > + veor q14, q14, q15 > + veor d24, d24, d25 > + veor d26, d26, d27 > + veor d28, d28, d29 > + vmov.i32 d25, #0 > + vmov.i32 d29, #0 > + vext.8 q12, q12, q12, #14 > + vext.8 q14, q14, q14, #15 > + veor d24, d24, d26 > + bx lr > +ENDPROC(__pmull16x64_p8) As in the arm64 version, a few comments here would help. > diff --git a/arch/arm/crypto/crct10dif-ce-glue.c b/arch/arm/crypto/crct10dif-ce-glue.c > index 60aa79c2fcdb..4431e4ce2dbe 100644 > --- a/arch/arm/crypto/crct10dif-ce-glue.c > +++ b/arch/arm/crypto/crct10dif-ce-glue.c > @@ -20,6 +20,7 @@ > #define CRC_T10DIF_PMULL_CHUNK_SIZE 16U > > asmlinkage u16 crc_t10dif_pmull64(u16 init_crc, const u8 *buf, size_t len); > +asmlinkage void crc_t10dif_pmull8(u16 init_crc, const u8 *buf, size_t len, u8 *out); Maybe explicitly type 'out' to 'u8 out[16]'? - Eric