On Tue, Dec 15, 2020 at 03:47:03PM -0800, Eric Biggers wrote: > This patchset adds a NEON implementation of BLAKE2b for 32-bit ARM. > Patches 1-4 prepare for it by making some updates to the generic > implementation, while patch 5 adds the actual NEON implementation. > > On Cortex-A7 (which these days is the most common ARM processor that > doesn't have the ARMv8 Crypto Extensions), this is over twice as fast as > SHA-256, and slightly faster than SHA-1. It is also almost three times > as fast as the generic implementation of BLAKE2b: > > Algorithm Cycles per byte (on 4096-byte messages) > =================== ======================================= > blake2b-256-neon 14.1 > sha1-neon 16.4 > sha1-asm 20.8 > blake2s-256-generic 26.1 > sha256-neon 28.9 > sha256-asm 32.1 > blake2b-256-generic 39.9 > > This implementation isn't directly based on any other implementation, > but it borrows some ideas from previous NEON code I've written as well > as from chacha-neon-core.S. At least on Cortex-A7, it is faster than > the other NEON implementations of BLAKE2b I'm aware of (the > implementation in the BLAKE2 official repository using intrinsics, and > Andrew Moon's implementation which can be found in SUPERCOP). > > NEON-optimized BLAKE2b is useful because there is interest in using > BLAKE2b-256 for dm-verity on low-end Android devices (specifically, > devices that lack the ARMv8 Crypto Extensions) to replace SHA-1. On > these devices, the performance cost of upgrading to SHA-256 may be > unacceptable, whereas BLAKE2b-256 would actually improve performance. > > Although BLAKE2b is intended for 64-bit platforms (unlike BLAKE2s which > is intended for 32-bit platforms), on 32-bit ARM processors with NEON, > BLAKE2b is actually faster than BLAKE2s. This is because NEON supports > 64-bit operations, and because BLAKE2s's block size is too small for > NEON to be helpful for it. The best I've been able to do with BLAKE2s > on Cortex-A7 is 19.0 cpb with an optimized scalar implementation. By the way, if people are interested in having my ARM scalar implementation of BLAKE2s in the kernel too, I can send a patchset for that too. It just ended up being slower than BLAKE2b and SHA-1, so it wasn't as good for the use case mentioned above. If it were to be added as "blake2s-256-arm", we'd have: Algorithm Cycles per byte (on 4096-byte messages) =================== ======================================= blake2b-256-neon 14.1 sha1-neon 16.4 blake2s-256-arm 19.0 sha1-asm 20.8 blake2s-256-generic 26.1 sha256-neon 28.9 sha256-asm 32.1 blake2b-256-generic 39.9