On Tue, Apr 09, 2024 at 06:52:02PM +0200, Stefan Kanthak wrote: > "Eric Biggers" <ebiggers@xxxxxxxxxx> wrote: > > > +.macro do_4rounds i, m0, m1, m2, m3 > > +.if \i < 16 > > + movdqu \i*4(DATA_PTR), MSG > > + pshufb SHUF_MASK, MSG > > + movdqa MSG, \m0 > > +.else > > + movdqa \m0, MSG > > +.endif > > + paddd \i*4(SHA256CONSTANTS), MSG > > To load the round constant independent from and parallel to the previous > instructions which use \m0 I recommend to change the first lines of the > do_4rounds macro as follows (this might save 1+ cycle per macro invocation, > and most obviously 2 lines): > > .macro do_4rounds i, m0, m1, m2, m3 > .if \i < 16 > movdqu \i*4(DATA_PTR), \m0 > pshufb SHUF_MASK, \m0 > .endif > movdqa \i*4(SHA256CONSTANTS), MSG > paddd \m0, MSG > ... Yes, your suggestion looks good. I don't see any performance difference on Ice Lake, but it does shorten the source code. It belongs in a separate patch though, since this patch isn't meant to change the output. - Eric