From: Jason A. Donenfeld > Sent: 06 May 2022 23:34 > > Hi David, > > On Thu, May 05, 2022 at 11:34:40AM +0000, David Laight wrote: > > OTOH the entropy mixing is very likely to be 'cold cache' > > and all the unrolling in blakes7 will completely kill > > performance. > > I've seen you mention the BLAKE2s unrolling in like 8 different threads > now, and I'm not convinced that you're entirely wrong, nor am I > convinced that you're entirely right. My response to you is the same as > always: please send a patch with some measurements! I'd love to get this > worked out in a real way. > > The last time I went benching these, the unrolled code was ~100 cycles > faster, if I recall correctly, than the rolled code, when used from > WireGuard's hot path. I don't doubt that a cold path would be more > fraught, though, as that's a decent amount of code. So the question is > how to re-roll the rounds without sacrificing those 100 cycles. > > In order to begin to figure that out, we have to look at why the > re-rolled loop is slow and the unrolled loop fast. It's not because of > complicated pipeline things. It's because the BLAKE2s permutation is > actually 10 different permutations, one for each round. Take a look at > the core function, G, and its uses of the round number, r: > > #define G(r, i, a, b, c, d) do { \ > a += b + m[blake2s_sigma[r][2 * i + 0]]; \ > d = ror32(d ^ a, 16); \ > c += d; \ > b = ror32(b ^ c, 12); \ > a += b + m[blake2s_sigma[r][2 * i + 1]]; \ > d = ror32(d ^ a, 8); \ > c += d; \ > b = ror32(b ^ c, 7); \ > } while (0) Each of those lines is a couple of instructions and they are all dependant on the preceding value. I count 14 - excluding the m[] accesses. There are 80 copies of G() - total 1120, or 17.5/byte. To get any faster than that you need to get the compiler to interleave the generated code for multiple expansions of G(). > The blake2s_sigma array is a `static const u8 blake2s_sigma[10][16]`, > with a row for every one of the 10 rounds. What this is actually doing > is reading the message words in a different order each round, so that > the whole permutation is different. > > When the loop is unrolled, blake2s_sigma gets inlined, and then there > are no memory accesses. When it's re-rolled, every round accesses > blake2s_sigma 16 times, which hinders performance. It shouldn't really make much difference. There are only two memory reads for each 14 arithmetic ops. So unless you manage to maintain 4 instructions/clock there are spare clocks for the extra memory cycles. Any that is assuming one read/clock, modern x86 can do 2 (with a following wind!) On x86 the array index is free. > You'll notice, on the other hand, that the SIMD hand coded assembly > implementations do not unroll. The trick is to hide the cost of the > blake2s_sigma indirection in the data dependencies, so that performance > isn't affected. Naively re-rolling the generic code does not inspire the > compiler to do that. But maybe you can figure something out? I've not looked at that version. I have written AVX/SSE code - hard work finding the asm mnumonics! > Anyway, that's about where my thinking is on this, but I'd love to see > some patches from you at some point if you're interested. Ok I just ran some tests looping over the ROUND() without updating v[]. These are using rdpmc to time single calls - not averaging over a large number of iterations. On my i7-7700 the unrolled loop is about 6.2 clocks/byte. The 'cold cache' for a single 64 byte block is about 20 clocks/byte. If I use: for (u32 i = 0; i < 10; i++) ROUND(i); it drops to 7.8 clocks/byte but the single block is about 15. Part of the problem is extra register spills to stack. So we play some games: Remove the blake2s_sigma[] out of G() into ROUND() so the 'normal' code uses ROUND(blake2s_sigma[0]) (etc). Then we can use the loop: for (const u8 *bs = &blake2s_sigma[0][0]; bs < end; bs += 16) ROUND(bs); (with some bits to make it compile.) Annoyingly that makes the compiler spill all the bs[] to stack. A few carefully placed barrier() would help. But simpler is 'volatile const u8 *bs'. This has the desired effect of interleaving the 'sigma' reads with the buffer ones. This gave me 7.2 clocks/byte, single block maybe 14. So about 1 clock/byte (maybe 15%) slower. OTOH in many cases this isn't critical. The cold cache cutoff for the unrolled loop is around 256 bytes. The other big effect the unrolled loop has for small buffers is that it displaces a lot of code from the i-cache. That will make the unrolled code worse for many real situations. I suspect that the compiler ends up spilling another register (or 2) to the stack. Hand assembly could avoid that. I've not looked at what gets spilled, gcc can make horrid choices. It might be worth making (say) a and c volatile (by splitting v[]) and using a temporary in G(), eg: c = t = c + d; b = ror32(b ^ t, 12); David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)