RE: [PATCH crypto 1/2] lib/crypto: blake2s-generic: reduce code size on small systems

David Laight <David.Laight@xxxxxxxxxx> · Wed, 12 Jan 2022 21:27:40 +0000

From: Jason A. Donenfeld
> Sent: 12 January 2022 18:51
> 
> On Wed, Jan 12, 2022 at 7:32 PM Eric Biggers <ebiggers@xxxxxxxxxx> wrote:
> > How about unrolling the inner loop but not the outer one?  Wouldn't that give
> > most of the benefit, without hurting performance as much?
> >
> > If you stay with this approach and don't unroll either loop, can you use 'r' and
> > 'i' instead of 'i' and 'j', to match the naming in G()?
> 
> All this might work, sure. But as mentioned earlier, I've abandoned
> this entirely, as I don't think this patch is necessary. See the v3
> patchset instead:
> 
> https://lore.kernel.org/linux-crypto/20220111220506.742067-1-Jason@xxxxxxxxx/

I think you mentioned in another thread that the buffers (eg for IPv6
addresses) are actually often quite short.

For short buffers the 'rolled-up' loop may be of similar performance
to the unrolled one because of the time taken to read all the instructions
into the I-cache and decode them.
If the loop ends up small enough it will fit into the 'decoded loop
buffer' of modern Intel x86 cpu and won't even need decoding on
each iteration.

I really suspect that the heavily unrolled loop is only really fast
for big buffers and/or when it is already in the I-cache.
In real life I wonder how often that actually happens?
Especially for the uses the kernel is making of the code.

You need to benchmark single executions of the function
(doable on x86 with the performance monitor cycle counter)
to get typical/best clocks/byte figures rather than a
big average for repeated operation on a long buffer.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)