On Sun, Oct 25, 2020 at 11:23:52PM +0000, David Laight wrote: > From: Arvind Sankar > > Sent: 25 October 2020 20:18 > > > > On Sun, Oct 25, 2020 at 06:51:18PM +0000, David Laight wrote: > > > From: Arvind Sankar > > > > Sent: 25 October 2020 14:31 > > > > > > > > Unrolling the LOAD and BLEND loops improves performance by ~8% on x86_64 > > > > (tested on Broadwell Xeon) while not increasing code size too much. > > > > > > I can't believe unrolling the BLEND loop makes any difference. > > > > It's actually the BLEND loop that accounts for almost all of the > > difference. The LOAD loop doesn't matter much in general: even replacing > > it with a plain memcpy() only increases performance by 3-4%. But > > unrolling it is low cost in code size terms, and clang actually does it > > without being asked. > > (memcpy is wrong - misses the byte swaps). I know it's wrong, the point is that it's impossible to gain very much from optimizing the LOAD loop because it doesn't account for much of the total time. > > That's odd, the BLEND loop is about 20 instructions. > I wouldn't expect unrolling to help - unless you manage > to use 16 registers for the active W[] values. > I am not sure about what's going on inside the hardware, but even with a straightforward assembly version that just reads out of memory the way the calculation is specified, unrolling the BLEND loop 8x speeds up the performance by 7-8%. The compiler is actually pretty bad here, just translating everything into assembler with no attempt to optimize anything gets a 10-12% speedup over the C version.