On Tue, Aug 21, 2018 at 9:46 AM Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> wrote: > > Replace the literal load of the addend vector with a sequence that > composes it using immediates. While at it, tweak the code that refers > to it so it does not clobber the register, so we can take the load > out of the loop as well. > > This results in generally better code, but also works around a Clang > issue, whose integrated assembler does not implement the GNU ARM asm > syntax completely, and does not support the =literal notation for > FP registers. Would you mind linking to the issue tracker for: https://bugs.llvm.org/show_bug.cgi?id=38642 And maybe the comment from the binutils source? (or arm32 reference manual you mentioned in https://lkml.org/lkml/2018/8/21/589) https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gas/testsuite/gas/aarch64/programmer-friendly.s;h=6254c6476efdc848648b05068be0574e7addc85d;hb=HEAD#l11 They can help provide more context to future travelers. > > Cc: Nick Desaulniers <ndesaulniers@xxxxxxxxxx> > Signed-off-by: Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> > --- > arch/arm64/crypto/aes-modes.S | 18 ++++++++++++------ > 1 file changed, 12 insertions(+), 6 deletions(-) > > diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S > index 483a7130cf0e..e966620ee230 100644 > --- a/arch/arm64/crypto/aes-modes.S > +++ b/arch/arm64/crypto/aes-modes.S > @@ -225,6 +225,14 @@ AES_ENTRY(aes_ctr_encrypt) > enc_prepare w22, x21, x6 > ld1 {v4.16b}, [x24] > > + /* compose addend vector { 1, 2, 3, 0 } in v8.4s */ > + movi v7.4h, #1 > + movi v8.4h, #2 > + uaddl v6.4s, v7.4h, v8.4h Clever; how come you didn't `movi v6.4h, #3` or `saddl`? Shorter encoding? Or does it simplify the zips later? Since the destination is larger, does this give us the 0? The following zip1/zip2 block is a little tricky. Can you help me understand it better? > + zip1 v8.8h, v7.8h, v8.8h If zip1 takes the upper half, wouldn't that be zeros, since we moved small constants into them, above? > + zip1 v8.4s, v8.4s, v6.4s > + zip2 v8.8h, v8.8h, v7.8h >From the docs [0], it looks like zip1/zip2 is used for interleaving two vectors, IIUC? In our case we're interleaving three vectors; v6, v7, and v8 into v8? And we don't need a second zip2 because...? Don't we need (or are more interested in) the bottom half of v6? Note to self: Page 6 [1] seems like a useful doc on arrangement specifiers. To get { 1, 2, 3, 0 }, does ARM have something like iota/lane id+swizzle instructions, ie: iota -> { 0, 1, 2, 3 } swizzle -> { 1, 2, 3, 0 } [0] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802a/UADDL_advsimd_vector.html [1] https://www.uio.no/studier/emner/matnat/ifi/INF5063/h17/timeplan/armv8-neon.pdf > + > umov x6, v4.d[1] /* keep swabbed ctr in reg */ > rev x6, x6 > .LctrloopNx: > @@ -232,17 +240,16 @@ AES_ENTRY(aes_ctr_encrypt) > bmi .Lctr1x > cmn w6, #4 /* 32 bit overflow? */ > bcs .Lctr1x > - ldr q8, =0x30000000200000001 /* addends 1,2,3[,0] */ > dup v7.4s, w6 > mov v0.16b, v4.16b > add v7.4s, v7.4s, v8.4s > mov v1.16b, v4.16b > - rev32 v8.16b, v7.16b > + rev32 v7.16b, v7.16b > mov v2.16b, v4.16b > mov v3.16b, v4.16b > - mov v1.s[3], v8.s[0] > - mov v2.s[3], v8.s[1] > - mov v3.s[3], v8.s[2] > + mov v1.s[3], v7.s[0] > + mov v2.s[3], v7.s[1] > + mov v3.s[3], v7.s[2] > ld1 {v5.16b-v7.16b}, [x20], #48 /* get 3 input blocks */ > bl aes_encrypt_block4x > eor v0.16b, v5.16b, v0.16b > @@ -296,7 +303,6 @@ AES_ENTRY(aes_ctr_encrypt) > ins v4.d[0], x7 > b .Lctrcarrydone > AES_ENDPROC(aes_ctr_encrypt) > - .ltorg > > > /* > -- > 2.17.1 > -- Thanks, ~Nick Desaulniers