Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks

Eric Biggers <ebiggers@xxxxxxxxxx> · Tue, 21 Apr 2020 21:04:15 -0700

On Mon, Apr 20, 2020 at 01:57:11AM -0600, Jason A. Donenfeld wrote:
> The initial Zinc patchset, after some mailing list discussion, contained
> code to ensure that kernel_fpu_enable would not be kept on for more than
> a PAGE_SIZE chunk, since it disables preemption. The choice of PAGE_SIZE
> isn't totally scientific, but it's not a bad guess either, and it's
> what's used in both the x86 poly1305 and blake2s library code already.
> Unfortunately it appears to have been left out of the final patchset
> that actually added the glue code. So, this commit adds back the
> PAGE_SIZE chunking.
> 
> Fixes: 84e03fa39fbe ("crypto: x86/chacha - expose SIMD ChaCha routine as library function")
> Fixes: b3aad5bad26a ("crypto: arm64/chacha - expose arm64 ChaCha routine as library function")
> Fixes: a44a3430d71b ("crypto: arm/chacha - expose ARM ChaCha routine as library function")
> Fixes: f569ca164751 ("crypto: arm64/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation")
> Fixes: a6b803b3ddc7 ("crypto: arm/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation")
> Cc: Eric Biggers <ebiggers@xxxxxxxxxx>
> Cc: Ard Biesheuvel <ardb@xxxxxxxxxx>
> Cc: stable@xxxxxxxxxxxxxxx
> Signed-off-by: Jason A. Donenfeld <Jason@xxxxxxxxx>
> ---
> Eric, Ard - I'm wondering if this was in fact just an oversight in Ard's
> patches, or if there was actually some later discussion in which we
> concluded that the PAGE_SIZE chunking wasn't required, perhaps because
> of FPU changes. If that's the case, please do let me know, in which case
> I'll submit a _different_ patch that removes the chunking from x86 poly
> and blake. I can't find any emails that would indicate that, but I might
> be mistaken.
> 
>  arch/arm/crypto/chacha-glue.c        | 16 +++++++++++++---
>  arch/arm/crypto/poly1305-glue.c      | 17 +++++++++++++----
>  arch/arm64/crypto/chacha-neon-glue.c | 16 +++++++++++++---
>  arch/arm64/crypto/poly1305-glue.c    | 17 +++++++++++++----
>  arch/x86/crypto/chacha_glue.c        | 16 +++++++++++++---
>  5 files changed, 65 insertions(+), 17 deletions(-)

I don't think you're missing anything.  On x86, kernel_fpu_begin() and
kernel_fpu_end() did get optimized in v5.2.  But they still disable preemption,
which is the concern here.

> 
> diff --git a/arch/arm/crypto/chacha-glue.c b/arch/arm/crypto/chacha-glue.c
> index 6fdb0ac62b3d..0e29ebac95fd 100644
> --- a/arch/arm/crypto/chacha-glue.c
> +++ b/arch/arm/crypto/chacha-glue.c
> @@ -91,9 +91,19 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes,
>  		return;
>  	}
>  
> -	kernel_neon_begin();
> -	chacha_doneon(state, dst, src, bytes, nrounds);
> -	kernel_neon_end();
> +	for (;;) {
> +		unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes);
> +
> +		kernel_neon_begin();
> +		chacha_doneon(state, dst, src, todo, nrounds);
> +		kernel_neon_end();
> +
> +		bytes -= todo;
> +		if (!bytes)
> +			break;
> +		src += todo;
> +		dst += todo;
> +	}
>  }
>  EXPORT_SYMBOL(chacha_crypt_arch);

Seems this should just be a 'while' loop?

	while (bytes) {
		unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes);

		kernel_neon_begin();
		chacha_doneon(state, dst, src, todo, nrounds);
		kernel_neon_end();

		bytes -= todo;
		src += todo;
		dst += todo;
	}

Likewise elsewhere in this patch.  (For Poly1305, len >= POLY1305_BLOCK_SIZE at
the beginning, so that could use a 'do' loop.)

- Eric