On Sun, Oct 25, 2020 at 06:51:18PM +0000, David Laight wrote: > From: Arvind Sankar > > Sent: 25 October 2020 14:31 > > > > Unrolling the LOAD and BLEND loops improves performance by ~8% on x86_64 > > (tested on Broadwell Xeon) while not increasing code size too much. > > I can't believe unrolling the BLEND loop makes any difference. It's actually the BLEND loop that accounts for almost all of the difference. The LOAD loop doesn't matter much in general: even replacing it with a plain memcpy() only increases performance by 3-4%. But unrolling it is low cost in code size terms, and clang actually does it without being asked. > WRT patch 5. > On the C2758 the original unrolled code is slightly faster. > On the i7-7700 the 8 unroll is a bit faster 'hot cache', > but slower 'cold cache' - probably because of the d-cache > loads for K[]. > > Non-x86 architectures might need to use d-cache reads for > the 32bit 'K' constants even in the unrolled loop. > X86 can use 'lea' with a 32bit offset to avoid data reads. > So the cold-cache case for the old code may be similar. Not sure I follow: in the old code, the K's are 32-bit immediates, so they should come from the i-cache whether an add or an lea is used? Why is the cold-cache case relevant anyway? If the code is only being executed a couple of times or so, i.e. you're hashing a single say 64-128 byte input once in a blue moon, the performance of the hash doesn't really matter, no? > > Interestingly I had to write an asm ror32() to get reasonable > code (in userspace). The C version the kernel uses didn't work. > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales) >