Re: [PATCH 3/6] crypto: arm64/crct10dif - Remove remaining 64x64 PMULL fallback code

Eric Biggers <ebiggers@xxxxxxxxxx> · Tue, 29 Oct 2024 21:15:57 -0700

On Mon, Oct 28, 2024 at 08:02:11PM +0100, Ard Biesheuvel wrote:
> From: Ard Biesheuvel <ardb@xxxxxxxxxx>
> 
> The only remaining user of the fallback implementation of 64x64
> polynomial multiplication using 8x8 PMULL instructions is the final
> reduction from a 16 byte vector to a 16-bit CRC.
> 
> The fallback code is complicated and messy, and this reduction has very
> little impact on the overall performance, so instead, let's calculate
> the final CRC by passing the 16 byte vector to the generic CRC-T10DIF
> implementation.
> 
> Signed-off-by: Ard Biesheuvel <ardb@xxxxxxxxxx>
> ---
>  arch/arm64/crypto/crct10dif-ce-core.S | 237 +++++---------------
>  arch/arm64/crypto/crct10dif-ce-glue.c |  15 +-
>  2 files changed, 64 insertions(+), 188 deletions(-)

For CRCs of short messages, doing a fast reduction from 128 bits can be quite
important.  But I agree that when only a 8x8 => 16 carryless multiplication is
available, it can't really be optimized, and just falling back to the generic
implementation is the right approach in that case.

> diff --git a/arch/arm64/crypto/crct10dif-ce-core.S b/arch/arm64/crypto/crct10dif-ce-core.S
> index 8d99ccf61f16..1db5d1d1e2b7 100644
[...]
>  	ad		.req	v14
> -
> -	k00_16		.req	v15
> -	k32_48		.req	v16
> +	bd		.req	v15
>  
>  	t3		.req	v17
>  	t4		.req	v18
> @@ -91,117 +89,7 @@
>  	t8		.req	v22
>  	t9		.req	v23

ad, bd, and t9 are no longer used.

> +	// Use Barrett reduction to compute the final CRC value.
> +	pmull2		v1.1q, v1.2d, fold_consts.2d	// high 32 bits * floor(x^48 / G(x))

v0.2d was accidentally replaced with v1.2d above, which is causing a self-test
failure in crct10dif-arm64-ce.

Otherwise this patch looks good; thanks!

- Eric