Re: [PATCH 0/5] crypto: add NEON-optimized BLAKE2b

Eric Biggers <ebiggers@xxxxxxxxxx> · Wed, 16 Dec 2020 12:47:56 -0800

On Tue, Dec 15, 2020 at 03:47:03PM -0800, Eric Biggers wrote:
> This patchset adds a NEON implementation of BLAKE2b for 32-bit ARM.
> Patches 1-4 prepare for it by making some updates to the generic
> implementation, while patch 5 adds the actual NEON implementation.
> 
> On Cortex-A7 (which these days is the most common ARM processor that
> doesn't have the ARMv8 Crypto Extensions), this is over twice as fast as
> SHA-256, and slightly faster than SHA-1.  It is also almost three times
> as fast as the generic implementation of BLAKE2b:
> 
> 	Algorithm            Cycles per byte (on 4096-byte messages)
> 	===================  =======================================
> 	blake2b-256-neon     14.1
> 	sha1-neon            16.4
> 	sha1-asm             20.8
> 	blake2s-256-generic  26.1
> 	sha256-neon	     28.9
> 	sha256-asm	     32.1
> 	blake2b-256-generic  39.9
> 
> This implementation isn't directly based on any other implementation,
> but it borrows some ideas from previous NEON code I've written as well
> as from chacha-neon-core.S.  At least on Cortex-A7, it is faster than
> the other NEON implementations of BLAKE2b I'm aware of (the
> implementation in the BLAKE2 official repository using intrinsics, and
> Andrew Moon's implementation which can be found in SUPERCOP).
> 
> NEON-optimized BLAKE2b is useful because there is interest in using
> BLAKE2b-256 for dm-verity on low-end Android devices (specifically,
> devices that lack the ARMv8 Crypto Extensions) to replace SHA-1.  On
> these devices, the performance cost of upgrading to SHA-256 may be
> unacceptable, whereas BLAKE2b-256 would actually improve performance.
> 
> Although BLAKE2b is intended for 64-bit platforms (unlike BLAKE2s which
> is intended for 32-bit platforms), on 32-bit ARM processors with NEON,
> BLAKE2b is actually faster than BLAKE2s.  This is because NEON supports
> 64-bit operations, and because BLAKE2s's block size is too small for
> NEON to be helpful for it.  The best I've been able to do with BLAKE2s
> on Cortex-A7 is 19.0 cpb with an optimized scalar implementation.

By the way, if people are interested in having my ARM scalar implementation of
BLAKE2s in the kernel too, I can send a patchset for that too.  It just ended up
being slower than BLAKE2b and SHA-1, so it wasn't as good for the use case
mentioned above.  If it were to be added as "blake2s-256-arm", we'd have:

	Algorithm            Cycles per byte (on 4096-byte messages)
	===================  =======================================
	blake2b-256-neon     14.1
	sha1-neon            16.4
	blake2s-256-arm      19.0
	sha1-asm             20.8
	blake2s-256-generic  26.1
	sha256-neon	     28.9
	sha256-asm	     32.1
	blake2b-256-generic  39.9