On Fri, Feb 18, 2022 at 04:34:28PM -0800, Eric Biggers wrote: > > +.macro schoolbook1_iteration i xor_sum > > + .set i, \i > > + .set xor_sum, \xor_sum > > + movups (16*i)(OP1), %xmm0 > > + .if(i == 0 && xor_sum == 1) > > + pxor SUM, %xmm0 > > + .endif > > + vpclmulqdq $0x01, (16*i)(OP2), %xmm0, %xmm1 > > + vpxor %xmm1, MI, MI > > + vpclmulqdq $0x00, (16*i)(OP2), %xmm0, %xmm2 > > + vpxor %xmm2, LO, LO > > + vpclmulqdq $0x11, (16*i)(OP2), %xmm0, %xmm3 > > + vpxor %xmm3, HI, HI > > + vpclmulqdq $0x10, (16*i)(OP2), %xmm0, %xmm4 > > + vpxor %xmm4, MI, MI > > Perhaps the above multiplications and XORs should be reordered slightly so that > each XOR doesn't depend on the previous instruction? A good ordering might be: > > vpclmulqdq $0x01, (16*\i)(OP2), %xmm0, %xmm1 > vpclmulqdq $0x10, (16*\i)(OP2), %xmm0, %xmm2 > vpclmulqdq $0x00, (16*\i)(OP2), %xmm0, %xmm3 > vpclmulqdq $0x11, (16*\i)(OP2), %xmm0, %xmm4 > vpxor %xmm1, MI, MI > vpxor %xmm3, LO, LO > vpxor %xmm4, HI, HI > vpxor %xmm2, MI, MI > > With that, no instruction would depend on either of the previous two > instructions. > > This might be more important in the ARM64 version than the x86_64 version, as > x86_64 CPUs are pretty aggressive about internally reordering instructions. But > it's something to consider in both versions. > > Likewise in schoolbook1_noload. Or slightly better: vpclmulqdq $0x01, (16*\i)(OP2), %xmm0, %xmm2 vpclmulqdq $0x00, (16*\i)(OP2), %xmm0, %xmm1 vpclmulqdq $0x10, (16*\i)(OP2), %xmm0, %xmm3 vpclmulqdq $0x11, (16*\i)(OP2), %xmm0, %xmm4 vpxor %xmm2, MI, MI vpxor %xmm1, LO, LO vpxor %xmm4, HI, HI vpxor %xmm3, MI, MI - Eric