On Mon, Oct 28, 2024 at 08:02:10PM +0100, Ard Biesheuvel wrote: > From: Ard Biesheuvel <ardb@xxxxxxxxxx> > > The CRC-T10DIF implementation for arm64 has a version that uses 8x8 > polynomial multiplication, for cores that lack the crypto extensions, > which cover the 64x64 polynomial multiplication instruction that the > algorithm was built around. > > This fallback version rather naively adopted the 64x64 polynomial > multiplication algorithm that I ported from ARM for the GHASH driver, > which needs 8 PMULL8 instructions to implement one PMULL64. This is > reasonable, given that each 8-bit vector element needs to be multiplied > with each element in the other vector, producing 8 vectors with partial > results that need to be combined to yield the correct result. > > However, most PMULL64 invocations in the CRC-T10DIF code involve > multiplication by a pair of 16-bit folding coefficients, and so all the > partial results from higher order bytes will be zero, and there is no > need to calculate them to begin with. > > Then, the CRC-T10DIF algorithm always XORs the output values of the > PMULL64 instructions being issued in pairs, and so there is no need to > faithfully implement each individual PMULL64 instruction, as long as > XORing the results pairwise produces the expected result. > > Implementing these improvements results in a speedup of 3.3x on low-end > platforms such as Raspberry Pi 4 (Cortex-A72) > > Signed-off-by: Ard Biesheuvel <ardb@xxxxxxxxxx> > --- > arch/arm64/crypto/crct10dif-ce-core.S | 71 +++++++++++++++----- > 1 file changed, 54 insertions(+), 17 deletions(-) Thanks, this makes sense. > +SYM_FUNC_START_LOCAL(__pmull_p8_16x64) > + ext t6.16b, t5.16b, t5.16b, #8 > + > + pmull t3.8h, t7.8b, t5.8b > + pmull t4.8h, t7.8b, t6.8b > + pmull2 t5.8h, t7.16b, t5.16b > + pmull2 t6.8h, t7.16b, t6.16b > + > + ext t8.16b, t3.16b, t3.16b, #8 > + eor t4.16b, t4.16b, t6.16b > + ext t7.16b, t5.16b, t5.16b, #8 > + ext t6.16b, t4.16b, t4.16b, #8 > + eor t8.8b, t8.8b, t3.8b > + eor t5.8b, t5.8b, t7.8b > + eor t4.8b, t4.8b, t6.8b > + ext t5.16b, t5.16b, t5.16b, #14 > + ret > +SYM_FUNC_END(__pmull_p8_16x64) A few comments in the above function would be really helpful. - Eric