From: Arvind Sankar > Sent: 25 October 2020 20:18 > > On Sun, Oct 25, 2020 at 06:51:18PM +0000, David Laight wrote: > > From: Arvind Sankar > > > Sent: 25 October 2020 14:31 > > > > > > Unrolling the LOAD and BLEND loops improves performance by ~8% on x86_64 > > > (tested on Broadwell Xeon) while not increasing code size too much. > > > > I can't believe unrolling the BLEND loop makes any difference. > > It's actually the BLEND loop that accounts for almost all of the > difference. The LOAD loop doesn't matter much in general: even replacing > it with a plain memcpy() only increases performance by 3-4%. But > unrolling it is low cost in code size terms, and clang actually does it > without being asked. (memcpy is wrong - misses the byte swaps). That's odd, the BLEND loop is about 20 instructions. I wouldn't expect unrolling to help - unless you manage to use 16 registers for the active W[] values. > > WRT patch 5. > > On the C2758 the original unrolled code is slightly faster. > > On the i7-7700 the 8 unroll is a bit faster 'hot cache', > > but slower 'cold cache' - probably because of the d-cache > > loads for K[]. > > > > Non-x86 architectures might need to use d-cache reads for > > the 32bit 'K' constants even in the unrolled loop. > > X86 can use 'lea' with a 32bit offset to avoid data reads. > > So the cold-cache case for the old code may be similar. > > Not sure I follow: in the old code, the K's are 32-bit immediates, so > they should come from the i-cache whether an add or an lea is used? I was thinking of other instruction sets that end up using pc-relative addressing for constants. Might only happen for 64bint ones though. > Why is the cold-cache case relevant anyway? If the code is only being > executed a couple of times or so, i.e. you're hashing a single say > 64-128 byte input once in a blue moon, the performance of the hash > doesn't really matter, no? I was measuring the cold cache one because I could. I didn't note the actual figures but it was 8-10 times slower that the hot-cache case. While sha256 is likely to be run hot-cache (on a big buffer) the cold-cache timing are probably relevant for things like memcpy(). I remember seeing a very long divide function for sparc32 that was probably only a gain in a benchmark loop - it would have displaced a lot of the working set from the i-cache! David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)