"Eric Biggers" <ebiggers@xxxxxxxxxx> wrote: > +.macro do_4rounds i, m0, m1, m2, m3 > +.if \i < 16 > + movdqu \i*4(DATA_PTR), MSG > + pshufb SHUF_MASK, MSG > + movdqa MSG, \m0 > +.else > + movdqa \m0, MSG > +.endif > + paddd \i*4(SHA256CONSTANTS), MSG To load the round constant independent from and parallel to the previous instructions which use \m0 I recommend to change the first lines of the do_4rounds macro as follows (this might save 1+ cycle per macro invocation, and most obviously 2 lines): .macro do_4rounds i, m0, m1, m2, m3 .if \i < 16 movdqu \i*4(DATA_PTR), \m0 pshufb SHUF_MASK, \m0 .endif movdqa \i*4(SHA256CONSTANTS), MSG paddd \m0, MSG ... regards Stefan