On 21 August 2018 at 20:04, Nick Desaulniers <ndesaulniers@xxxxxxxxxx> wrote: > On Tue, Aug 21, 2018 at 9:46 AM Ard Biesheuvel > <ard.biesheuvel@xxxxxxxxxx> wrote: >> >> Replace the literal load of the addend vector with a sequence that >> composes it using immediates. While at it, tweak the code that refers >> to it so it does not clobber the register, so we can take the load >> out of the loop as well. >> >> This results in generally better code, but also works around a Clang >> issue, whose integrated assembler does not implement the GNU ARM asm >> syntax completely, and does not support the =literal notation for >> FP registers. > > Would you mind linking to the issue tracker for: > https://bugs.llvm.org/show_bug.cgi?id=38642 > > And maybe the comment from the binutils source? (or arm32 reference > manual you mentioned in https://lkml.org/lkml/2018/8/21/589) > https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gas/testsuite/gas/aarch64/programmer-friendly.s;h=6254c6476efdc848648b05068be0574e7addc85d;hb=HEAD#l11 > > They can help provide more context to future travelers. > Sure, if it helps. >> >> Cc: Nick Desaulniers <ndesaulniers@xxxxxxxxxx> >> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> >> --- >> arch/arm64/crypto/aes-modes.S | 18 ++++++++++++------ >> 1 file changed, 12 insertions(+), 6 deletions(-) >> >> diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S >> index 483a7130cf0e..e966620ee230 100644 >> --- a/arch/arm64/crypto/aes-modes.S >> +++ b/arch/arm64/crypto/aes-modes.S >> @@ -225,6 +225,14 @@ AES_ENTRY(aes_ctr_encrypt) >> enc_prepare w22, x21, x6 >> ld1 {v4.16b}, [x24] >> >> + /* compose addend vector { 1, 2, 3, 0 } in v8.4s */ >> + movi v7.4h, #1 >> + movi v8.4h, #2 >> + uaddl v6.4s, v7.4h, v8.4h > > Clever; how come you didn't `movi v6.4h, #3` or `saddl`? Shorter > encoding? Or does it simplify the zips later? Encodings are always 32 bits on AArch64, so that does not make a difference. > Since the destination > is larger, does this give us the 0? > Yes. > The following zip1/zip2 block is a little tricky. Can you help me > understand it better? > Please see below. >> + zip1 v8.8h, v7.8h, v8.8h > > If zip1 takes the upper half, wouldn't that be zeros, since we moved > small constants into them, above? > >> + zip1 v8.4s, v8.4s, v6.4s >> + zip2 v8.8h, v8.8h, v7.8h > > From the docs [0], it looks like zip1/zip2 is used for interleaving > two vectors, IIUC? In our case we're interleaving three vectors; v6, > v7, and v8 into v8? > No, the first register is only the destination register in this case, not a source register. > And we don't need a second zip2 because...? Don't we need (or are > more interested in) the bottom half of v6? > To clarify, these are the consecutive values of each of the registers, using 16-bit elements: v7 := { 1, 1, 1, 1, 0, 0, 0, 0 } v8 := { 2, 2, 2, 2, 0, 0, 0, 0 } v6 := { 3, 0, 3, 0, 3, 0, 3, 0 } v8 := { 1, 2, 1, 2, 1, 2, 1, 2 } v8 := { 1, 2, 3, 0, 1, 2, 3, 0 } v8 := { 1, 0, 2, 0, 3, 0, 0, 0 } > Note to self: Page 6 [1] seems like a useful doc on arrangement specifiers. > > To get { 1, 2, 3, 0 }, does ARM have something like iota/lane > id+swizzle instructions, ie: > > iota -> { 0, 1, 2, 3 } > swizzle -> { 1, 2, 3, 0 } > AArch64 has the ext instruction, which concatenates n trailing bytes of one source register with 16-n leading bytes of another. This can be used, e.g., to implement a byte sized rotate when using the same register for both inputs. > [0] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802a/UADDL_advsimd_vector.html > [1] https://www.uio.no/studier/emner/matnat/ifi/INF5063/h17/timeplan/armv8-neon.pdf > >> + >> umov x6, v4.d[1] /* keep swabbed ctr in reg */ >> rev x6, x6 >> .LctrloopNx: >> @@ -232,17 +240,16 @@ AES_ENTRY(aes_ctr_encrypt) >> bmi .Lctr1x >> cmn w6, #4 /* 32 bit overflow? */ >> bcs .Lctr1x >> - ldr q8, =0x30000000200000001 /* addends 1,2,3[,0] */ >> dup v7.4s, w6 >> mov v0.16b, v4.16b >> add v7.4s, v7.4s, v8.4s >> mov v1.16b, v4.16b >> - rev32 v8.16b, v7.16b >> + rev32 v7.16b, v7.16b >> mov v2.16b, v4.16b >> mov v3.16b, v4.16b >> - mov v1.s[3], v8.s[0] >> - mov v2.s[3], v8.s[1] >> - mov v3.s[3], v8.s[2] >> + mov v1.s[3], v7.s[0] >> + mov v2.s[3], v7.s[1] >> + mov v3.s[3], v7.s[2] >> ld1 {v5.16b-v7.16b}, [x20], #48 /* get 3 input blocks */ >> bl aes_encrypt_block4x >> eor v0.16b, v5.16b, v0.16b >> @@ -296,7 +303,6 @@ AES_ENTRY(aes_ctr_encrypt) >> ins v4.d[0], x7 >> b .Lctrcarrydone >> AES_ENDPROC(aes_ctr_encrypt) >> - .ltorg >> >> >> /* >> -- >> 2.17.1 >> > > > -- > Thanks, > ~Nick Desaulniers