On Tue, Oct 20, 2020 at 09:36:00PM +0000, David Laight wrote: > From: Arvind Sankar > > Sent: 20 October 2020 21:40 > > > > Putting the round constants and the message schedule arrays together in > > one structure saves one register, which can be a significant benefit on > > register-constrained architectures. On x86-32 (tested on Broadwell > > Xeon), this gives a 10% performance benefit. > > I'm actually stunned it makes that much difference. > The object code must be truly horrid (before and after). > > There are probably other strange tweaks that give a similar > improvement. > > David > Hm yes, I took a closer look at the generated code, and gcc seems to be doing something completely braindead. Before this change, it actually copies 8 words at a time from SHA256_K onto the stack, and uses those stack temporaries for the calculation. So this patch is giving a benefit just because it only does the copy once instead of every time around the loop. It doesn't even really need a register to hold SHA256_K since this isn't PIC code, it could just access it directly as SHA256_K(%ecx) if it just multiplied the loop counter i by 4.