On Thu, Dec 17, 2020 at 4:54 AM Eric Biggers <ebiggers@xxxxxxxxxx> wrote: > > On Wed, Dec 16, 2020 at 11:32:44PM +0100, Jason A. Donenfeld wrote: > > Hi Eric, > > > > On Wed, Dec 16, 2020 at 9:48 PM Eric Biggers <ebiggers@xxxxxxxxxx> wrote: > > > By the way, if people are interested in having my ARM scalar implementation of > > > BLAKE2s in the kernel too, I can send a patchset for that too. It just ended up > > > being slower than BLAKE2b and SHA-1, so it wasn't as good for the use case > > > mentioned above. If it were to be added as "blake2s-256-arm", we'd have: > > > > I'd certainly be interested in this. Any rough idea how it performs > > for pretty small messages compared to the generic implementation? > > 100-140 byte ranges? Is the speedup about the same as for longer > > messages because this doesn't parallelize across multiple blocks? > > > > It does one block at a time, and there isn't much overhead, so yes the speedup > on short messages should be about the same as on long messages. > > I did a couple quick userspace benchmarks and got (still on Cortex-A7): > > 100-byte messages: > BLAKE2s ARM: 28.9 cpb > BLAKE2s generic: 42.4 cpb > > 140-byte messages: > BLAKE2s ARM: 29.5 cpb > BLAKE2s generic: 44.0 cpb > > The results in the kernel may differ a bit, but probably not by much. That's certainly a nice improvement though, and I'd very much welcome the faster implementation. Jason