From: Eric Biggers > Sent: 22 October 2020 05:35 > > On Tue, Oct 20, 2020 at 04:39:57PM -0400, Arvind Sankar wrote: > > Putting the round constants and the message schedule arrays together in > > one structure saves one register, which can be a significant benefit on > > register-constrained architectures. On x86-32 (tested on Broadwell > > Xeon), this gives a 10% performance benefit. > > > > Signed-off-by: Arvind Sankar <nivedita@xxxxxxxxxxxx> > > Suggested-by: David Laight <David.Laight@xxxxxxxxxx> > > --- > > lib/crypto/sha256.c | 49 ++++++++++++++++++++++++++------------------- > > 1 file changed, 28 insertions(+), 21 deletions(-) > > > > diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c > > index 3a8802d5f747..985cd0560d79 100644 > > --- a/lib/crypto/sha256.c > > +++ b/lib/crypto/sha256.c > > @@ -29,6 +29,11 @@ static const u32 SHA256_K[] = { > > 0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2, > > }; > > > > +struct KW { > > + u32 K[64]; > > + u32 W[64]; > > +}; > > Note that this doubles the stack usage from 256 to 512 bytes. That's pretty > large for kernel code, especially when compiler options can increase the stack > usage well beyond the "expected" value. > > So unless this gives a big performance improvement on architectures other than > 32-bit x86 (which people don't really care about these days), we probably > shouldn't do this. IIRC the gain came from an odd side effect - which can probably be got (for some compiler versions) by other means. > FWIW, it's possible to reduce the length of 'W' to 16 words by computing the > next W value just before each round 16-63, I was looking at that. You'd need to do the first 16 rounds then rounds 17-63 in a second loop to avoid the conditional. The problem is that it needs too many registers. You'd need registers for 16 W values, the 8 a-h and a few spare. ... Looking closely each round is like: t1 = h + e1(e) + Ch(e, f, g) + 0x428a2f98 + W[0]; t2 = e0(a) + Maj(a, b, c); h = t1 + t2; // Not used for a few rounds d += t1; // Needed next round So only needs 4 of the state variables (e, f, g, h). The next round uses d, e, f and g. So with extreme care d and h can use the same register. Although I'm not sure how you'd get the compiler to do it. Possibly making state[] volatile (or using READ/WRITE_ONCE). So the last two lines become: state[7] = t1 + t2; d = state[3] + t1; That might stop the x86 (32bit) spilling registers. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)