"Eric Biggers" <ebiggers@xxxxxxxxxx> wrote: > On Tue, Apr 09, 2024 at 06:52:02PM +0200, Stefan Kanthak wrote: >> "Eric Biggers" <ebiggers@xxxxxxxxxx> wrote: >> >> > +.macro do_4rounds i, m0, m1, m2, m3 >> > +.if \i < 16 >> > + movdqu \i*4(DATA_PTR), MSG >> > + pshufb SHUF_MASK, MSG >> > + movdqa MSG, \m0 >> > +.else >> > + movdqa \m0, MSG >> > +.endif >> > + paddd \i*4(SHA256CONSTANTS), MSG >> >> To load the round constant independent from and parallel to the previous >> instructions which use \m0 I recommend to change the first lines of the >> do_4rounds macro as follows (this might save 1+ cycle per macro invocation, >> and most obviously 2 lines): >> >> .macro do_4rounds i, m0, m1, m2, m3 >> .if \i < 16 >> movdqu \i*4(DATA_PTR), \m0 >> pshufb SHUF_MASK, \m0 >> .endif >> movdqa \i*4(SHA256CONSTANTS), MSG >> paddd \m0, MSG >> ... > > Yes, your suggestion looks good. I don't see any performance difference on > Ice Lake, but it does shorten the source code. It belongs in a separate patch > though, since this patch isn't meant to change the output. Hmmm... the output was already changed: 2 palignr/pblendw and 16 pshufd have been replaced with punpck?qdq, and 17 displacements changed. Next simplification, and 5 more lines gone: replace the macro do_16rounds with a repetition @@ ... -.macro do_16rounds i - do_4rounds (\i + 0), MSGTMP0, MSGTMP1, MSGTMP2, MSGTMP3 - do_4rounds (\i + 4), MSGTMP1, MSGTMP2, MSGTMP3, MSGTMP0 - do_4rounds (\i + 8), MSGTMP2, MSGTMP3, MSGTMP0, MSGTMP1 - do_4rounds (\i + 12), MSGTMP3, MSGTMP0, MSGTMP1, MSGTMP2 -.endm - @@ ... - do_16rounds 0 - do_16rounds 16 - do_16rounds 32 - do_16rounds 48 +.irp i, 0, 16, 32, 48 + do_4rounds (\i + 0), MSGTMP0, MSGTMP1, MSGTMP2, MSGTMP3 + do_4rounds (\i + 4), MSGTMP1, MSGTMP2, MSGTMP3, MSGTMP0 + do_4rounds (\i + 8), MSGTMP2, MSGTMP3, MSGTMP0, MSGTMP1 + do_4rounds (\i + 12), MSGTMP3, MSGTMP0, MSGTMP1, MSGTMP2 +.endr This doesn't change the instructions generated, so it belongs to this patch. The following suggestion changes instructions: AFAIK all processors which support the SHA extensions support AVX too @@ ... +.ifnotdef AVX movdqa STATE0, MSGTMP4 punpcklqdq STATE1, STATE0 /* FEBA */ punpckhqdq MSGTMP4, STATE1 /* DCHG */ pshufd $0x1B, STATE0, STATE0 /* ABEF */ pshufd $0xB1, STATE1, STATE1 /* CDGH */ +.else + vpunpckhqdq STATE0, STATE1, MSGTMP0 /* DCHG */ + vpunpcklqdq STATE0, STATE1, MSGTMP1 /* BAFE */ + pshufd $0xB1, MSGTMP0, STATE0 /* CDGH */ + pshufd $0xB1, MSGTMP1, STATE1 /* ABEF */ +.endif @@ ... +.ifnotdef AVX movdqa \i*4(SHA256CONSTANTS), MSG paddd \m0, MSG +.else + vpaddd \i*4(SHA256CONSTANTS), \m0, MSG +.endif @@ ... +.ifnotdef AVX movdqa \m0, MSGTMP4 palignr $4, \m3, MSGTMP4 +.else + vpalignr $4, \m3, \m0, MSGTMP4 +.endif @@ ... +.ifnotdef AVX movdqa STATE1, MSGTMP4 punpcklqdq STATE0, STATE1 /* EFGH */ punpckhqdq MSGTMP4, STATE0 /* CDAB */ pshufd $0x1B, STATE0, STATE0 /* DCBA */ pshufd $0xB1, STATE1, STATE1 /* HGFE */ +.else + vpunpckhqdq STATE0, STATE1, MSGTMP0 /* ABCD */ + vpunpcklqdq STATE0, STATE1, MSGTMP1 /* EFGH */ + pshufd $0x1B, MSGTMP0, STATE0 /* DCBA */ + pshufd $0x1B, MSGTMP1, STATE1 /* HGFE */ +.endif And last: are the "#define ... %xmm?" really necessary? - MSG can't be anything but %xmm0; - MSGTMP4 is despite its prefix MSG also used to shuffle STATE0 and STATE1, so it should be named TMP instead (if kept); - MSGTMP0 to MSGTMP3 are the circular message schedule, they should be named MSG0 to MSG3 instead (if kept). I suggest to remove at least those which are now encapsulated in the macro and the repetition. regards Stefan