Re: [PATCH] crypto: arm64/aes-modes - get rid of literal load of addend vector

Nick Desaulniers <ndesaulniers@xxxxxxxxxx> · Tue, 21 Aug 2018 11:04:06 -0700

On Tue, Aug 21, 2018 at 9:46 AM Ard Biesheuvel
<ard.biesheuvel@xxxxxxxxxx> wrote:
>
> Replace the literal load of the addend vector with a sequence that
> composes it using immediates. While at it, tweak the code that refers
> to it so it does not clobber the register, so we can take the load
> out of the loop as well.
>
> This results in generally better code, but also works around a Clang
> issue, whose integrated assembler does not implement the GNU ARM asm
> syntax completely, and does not support the =literal notation for
> FP registers.

Would you mind linking to the issue tracker for:
https://bugs.llvm.org/show_bug.cgi?id=38642

And maybe the comment from the binutils source? (or arm32 reference
manual you mentioned in https://lkml.org/lkml/2018/8/21/589)
https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gas/testsuite/gas/aarch64/programmer-friendly.s;h=6254c6476efdc848648b05068be0574e7addc85d;hb=HEAD#l11

They can help provide more context to future travelers.

>
> Cc: Nick Desaulniers <ndesaulniers@xxxxxxxxxx>
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx>
> ---
>  arch/arm64/crypto/aes-modes.S | 18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/arch/arm64/crypto/aes-modes.S b/arch/arm64/crypto/aes-modes.S
> index 483a7130cf0e..e966620ee230 100644
> --- a/arch/arm64/crypto/aes-modes.S
> +++ b/arch/arm64/crypto/aes-modes.S
> @@ -225,6 +225,14 @@ AES_ENTRY(aes_ctr_encrypt)
>         enc_prepare     w22, x21, x6
>         ld1             {v4.16b}, [x24]
>
> +       /* compose addend vector { 1, 2, 3, 0 } in v8.4s */
> +       movi            v7.4h, #1
> +       movi            v8.4h, #2
> +       uaddl           v6.4s, v7.4h, v8.4h

Clever; how come you didn't `movi v6.4h, #3` or `saddl`?  Shorter
encoding?  Or does it simplify the zips later?  Since the destination
is larger, does this give us the 0?

The following zip1/zip2 block is a little tricky. Can you help me
understand it better?

> +       zip1            v8.8h, v7.8h, v8.8h

If zip1 takes the upper half, wouldn't that be zeros, since we moved
small constants into them, above?

> +       zip1            v8.4s, v8.4s, v6.4s
> +       zip2            v8.8h, v8.8h, v7.8h

>From the docs [0], it looks like zip1/zip2 is used for interleaving
two vectors, IIUC?  In our case we're interleaving three vectors; v6,
v7, and v8 into v8?

And we don't need a second zip2 because...?  Don't we need (or are
more interested in) the bottom half of v6?

Note to self: Page 6 [1] seems like a useful doc on arrangement specifiers.

To get { 1, 2, 3, 0 }, does ARM have something like iota/lane
id+swizzle instructions, ie:

iota -> { 0, 1, 2, 3 }
swizzle -> { 1, 2, 3, 0 }

[0] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802a/UADDL_advsimd_vector.html
[1] https://www.uio.no/studier/emner/matnat/ifi/INF5063/h17/timeplan/armv8-neon.pdf

> +
>         umov            x6, v4.d[1]             /* keep swabbed ctr in reg */
>         rev             x6, x6
>  .LctrloopNx:
> @@ -232,17 +240,16 @@ AES_ENTRY(aes_ctr_encrypt)
>         bmi             .Lctr1x
>         cmn             w6, #4                  /* 32 bit overflow? */
>         bcs             .Lctr1x
> -       ldr             q8, =0x30000000200000001        /* addends 1,2,3[,0] */
>         dup             v7.4s, w6
>         mov             v0.16b, v4.16b
>         add             v7.4s, v7.4s, v8.4s
>         mov             v1.16b, v4.16b
> -       rev32           v8.16b, v7.16b
> +       rev32           v7.16b, v7.16b
>         mov             v2.16b, v4.16b
>         mov             v3.16b, v4.16b
> -       mov             v1.s[3], v8.s[0]
> -       mov             v2.s[3], v8.s[1]
> -       mov             v3.s[3], v8.s[2]
> +       mov             v1.s[3], v7.s[0]
> +       mov             v2.s[3], v7.s[1]
> +       mov             v3.s[3], v7.s[2]
>         ld1             {v5.16b-v7.16b}, [x20], #48     /* get 3 input blocks */
>         bl              aes_encrypt_block4x
>         eor             v0.16b, v5.16b, v0.16b
> @@ -296,7 +303,6 @@ AES_ENTRY(aes_ctr_encrypt)
>         ins             v4.d[0], x7
>         b               .Lctrcarrydone
>  AES_ENDPROC(aes_ctr_encrypt)
> -       .ltorg
>
>
>         /*
> --
> 2.17.1
>

-- 
Thanks,
~Nick Desaulniers