Re: [PATCH v3 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS

Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> · Sun, 17 Jun 2018 11:30:27 +0200

On 17 June 2018 at 00:40, Stefan Agner <stefan@xxxxxxxx> wrote:
> Hi Eric,
>
> On 14.02.2018 19:42, Eric Biggers wrote:
>> Add an ARM NEON-accelerated implementation of Speck-XTS.  It operates on
>> 128-byte chunks at a time, i.e. 8 blocks for Speck128 or 16 blocks for
>> Speck64.  Each 128-byte chunk goes through XTS preprocessing, then is
>> encrypted/decrypted (doing one cipher round for all the blocks, then the
>> next round, etc.), then goes through XTS postprocessing.
>>
>> The performance depends on the processor but can be about 3 times faster
>> than the generic code.  For example, on an ARMv7 processor we observe
>> the following performance with Speck128/256-XTS:
>>
>>     xts-speck128-neon:     Encryption 107.9 MB/s, Decryption 108.1 MB/s
>>     xts(speck128-generic): Encryption  32.1 MB/s, Decryption  36.6 MB/s
>>
>> In comparison to AES-256-XTS without the Cryptography Extensions:
>>
>>     xts-aes-neonbs:        Encryption  41.2 MB/s, Decryption  36.7 MB/s
>>     xts(aes-asm):          Encryption  31.7 MB/s, Decryption  30.8 MB/s
>>     xts(aes-generic):      Encryption  21.2 MB/s, Decryption  20.9 MB/s
>>
>> Speck64/128-XTS is even faster:
>>
>>     xts-speck64-neon:      Encryption 138.6 MB/s, Decryption 139.1 MB/s
>>
>> Note that as with the generic code, only the Speck128 and Speck64
>> variants are supported.  Also, for now only the XTS mode of operation is
>> supported, to target the disk and file encryption use cases.  The NEON
>> code also only handles the portion of the data that is evenly divisible
>> into 128-byte chunks, with any remainder handled by a C fallback.  Of
>> course, other modes of operation could be added later if needed, and/or
>> the NEON code could be updated to handle other buffer sizes.
>>
>> The XTS specification is only defined for AES which has a 128-bit block
>> size, so for the GF(2^64) math needed for Speck64-XTS we use the
>> reducing polynomial 'x^64 + x^4 + x^3 + x + 1' given by the original XEX
>> paper.  Of course, when possible users should use Speck128-XTS, but even
>> that may be too slow on some processors; Speck64-XTS can be faster.
>>
>> Signed-off-by: Eric Biggers <ebiggers@xxxxxxxxxx>
>> ---
>>  arch/arm/crypto/Kconfig           |   6 +
>>  arch/arm/crypto/Makefile          |   2 +
>>  arch/arm/crypto/speck-neon-core.S | 432 ++++++++++++++++++++++++++++++
>>  arch/arm/crypto/speck-neon-glue.c | 288 ++++++++++++++++++++
>>  4 files changed, 728 insertions(+)
>>  create mode 100644 arch/arm/crypto/speck-neon-core.S
>>  create mode 100644 arch/arm/crypto/speck-neon-glue.c
>>
>> diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
>> index b8e69fe282b8..925d1364727a 100644
>> --- a/arch/arm/crypto/Kconfig
>> +++ b/arch/arm/crypto/Kconfig
>> @@ -121,4 +121,10 @@ config CRYPTO_CHACHA20_NEON
>>       select CRYPTO_BLKCIPHER
>>       select CRYPTO_CHACHA20
>>
>> +config CRYPTO_SPECK_NEON
>> +     tristate "NEON accelerated Speck cipher algorithms"
>> +     depends on KERNEL_MODE_NEON
>> +     select CRYPTO_BLKCIPHER
>> +     select CRYPTO_SPECK
>> +
>>  endif
>> diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
>> index 30ef8e291271..a758107c5525 100644
>> --- a/arch/arm/crypto/Makefile
>> +++ b/arch/arm/crypto/Makefile
>> @@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
>>  obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
>>  obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
>>  obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
>> +obj-$(CONFIG_CRYPTO_SPECK_NEON) += speck-neon.o
>>
>>  ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
>>  ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o
>> @@ -53,6 +54,7 @@ ghash-arm-ce-y      := ghash-ce-core.o ghash-ce-glue.o
>>  crct10dif-arm-ce-y   := crct10dif-ce-core.o crct10dif-ce-glue.o
>>  crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
>>  chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
>> +speck-neon-y := speck-neon-core.o speck-neon-glue.o
>>
>>  quiet_cmd_perl = PERL    $@
>>        cmd_perl = $(PERL) $(<) > $(@)
>> diff --git a/arch/arm/crypto/speck-neon-core.S
>> b/arch/arm/crypto/speck-neon-core.S
>> new file mode 100644
>> index 000000000000..3c1e203e53b9
>> --- /dev/null
>> +++ b/arch/arm/crypto/speck-neon-core.S
>> @@ -0,0 +1,432 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * NEON-accelerated implementation of Speck128-XTS and Speck64-XTS
>> + *
>> + * Copyright (c) 2018 Google, Inc
>> + *
>> + * Author: Eric Biggers <ebiggers@xxxxxxxxxx>
>> + */
>> +
>> +#include <linux/linkage.h>
>> +
>> +     .text
>> +     .fpu            neon
>> +
>> +     // arguments
>> +     ROUND_KEYS      .req    r0      // const {u64,u32} *round_keys
>> +     NROUNDS         .req    r1      // int nrounds
>> +     DST             .req    r2      // void *dst
>> +     SRC             .req    r3      // const void *src
>> +     NBYTES          .req    r4      // unsigned int nbytes
>> +     TWEAK           .req    r5      // void *tweak
>> +
>> +     // registers which hold the data being encrypted/decrypted
>> +     X0              .req    q0
>> +     X0_L            .req    d0
>> +     X0_H            .req    d1
>> +     Y0              .req    q1
>> +     Y0_H            .req    d3
>> +     X1              .req    q2
>> +     X1_L            .req    d4
>> +     X1_H            .req    d5
>> +     Y1              .req    q3
>> +     Y1_H            .req    d7
>> +     X2              .req    q4
>> +     X2_L            .req    d8
>> +     X2_H            .req    d9
>> +     Y2              .req    q5
>> +     Y2_H            .req    d11
>> +     X3              .req    q6
>> +     X3_L            .req    d12
>> +     X3_H            .req    d13
>> +     Y3              .req    q7
>> +     Y3_H            .req    d15
>> +
>> +     // the round key, duplicated in all lanes
>> +     ROUND_KEY       .req    q8
>> +     ROUND_KEY_L     .req    d16
>> +     ROUND_KEY_H     .req    d17
>> +
>> +     // index vector for vtbl-based 8-bit rotates
>> +     ROTATE_TABLE    .req    d18
>> +
>> +     // multiplication table for updating XTS tweaks
>> +     GF128MUL_TABLE  .req    d19
>> +     GF64MUL_TABLE   .req    d19
>> +
>> +     // current XTS tweak value(s)
>> +     TWEAKV          .req    q10
>> +     TWEAKV_L        .req    d20
>> +     TWEAKV_H        .req    d21
>> +
>> +     TMP0            .req    q12
>> +     TMP0_L          .req    d24
>> +     TMP0_H          .req    d25
>> +     TMP1            .req    q13
>> +     TMP2            .req    q14
>> +     TMP3            .req    q15
>> +
>> +     .align          4
>> +.Lror64_8_table:
>> +     .byte           1, 2, 3, 4, 5, 6, 7, 0
>> +.Lror32_8_table:
>> +     .byte           1, 2, 3, 0, 5, 6, 7, 4
>> +.Lrol64_8_table:
>> +     .byte           7, 0, 1, 2, 3, 4, 5, 6
>> +.Lrol32_8_table:
>> +     .byte           3, 0, 1, 2, 7, 4, 5, 6
>> +.Lgf128mul_table:
>> +     .byte           0, 0x87
>> +     .fill           14
>> +.Lgf64mul_table:
>> +     .byte           0, 0x1b, (0x1b << 1), (0x1b << 1) ^ 0x1b
>> +     .fill           12
>> +
>> +/*
>> + * _speck_round_128bytes() - Speck encryption round on 128 bytes at a time
>> + *
>> + * Do one Speck encryption round on the 128 bytes (8 blocks for
>> Speck128, 16 for
>> + * Speck64) stored in X0-X3 and Y0-Y3, using the round key stored in all lanes
>> + * of ROUND_KEY.  'n' is the lane size: 64 for Speck128, or 32 for Speck64.
>> + *
>> + * The 8-bit rotates are implemented using vtbl instead of vshr + vsli because
>> + * the vtbl approach is faster on some processors and the same speed on others.
>> + */
>> +.macro _speck_round_128bytes n
>> +
>> +     // x = ror(x, 8)
>> +     vtbl.8          X0_L, {X0_L}, ROTATE_TABLE
>> +     vtbl.8          X0_H, {X0_H}, ROTATE_TABLE
>> +     vtbl.8          X1_L, {X1_L}, ROTATE_TABLE
>> +     vtbl.8          X1_H, {X1_H}, ROTATE_TABLE
>> +     vtbl.8          X2_L, {X2_L}, ROTATE_TABLE
>> +     vtbl.8          X2_H, {X2_H}, ROTATE_TABLE
>> +     vtbl.8          X3_L, {X3_L}, ROTATE_TABLE
>> +     vtbl.8          X3_H, {X3_H}, ROTATE_TABLE
>> +
>> +     // x += y
>> +     vadd.u\n        X0, Y0
>> +     vadd.u\n        X1, Y1
>> +     vadd.u\n        X2, Y2
>> +     vadd.u\n        X3, Y3
>> +
>> +     // x ^= k
>> +     veor            X0, ROUND_KEY
>> +     veor            X1, ROUND_KEY
>> +     veor            X2, ROUND_KEY
>> +     veor            X3, ROUND_KEY
>> +
>> +     // y = rol(y, 3)
>> +     vshl.u\n        TMP0, Y0, #3
>> +     vshl.u\n        TMP1, Y1, #3
>> +     vshl.u\n        TMP2, Y2, #3
>> +     vshl.u\n        TMP3, Y3, #3
>> +     vsri.u\n        TMP0, Y0, #(\n - 3)
>> +     vsri.u\n        TMP1, Y1, #(\n - 3)
>> +     vsri.u\n        TMP2, Y2, #(\n - 3)
>> +     vsri.u\n        TMP3, Y3, #(\n - 3)
>> +
>> +     // y ^= x
>> +     veor            Y0, TMP0, X0
>> +     veor            Y1, TMP1, X1
>> +     veor            Y2, TMP2, X2
>> +     veor            Y3, TMP3, X3
>> +.endm
>> +
>> +/*
>> + * _speck_unround_128bytes() - Speck decryption round on 128 bytes at a time
>> + *
>> + * This is the inverse of _speck_round_128bytes().
>> + */
>> +.macro _speck_unround_128bytes       n
>> +
>> +     // y ^= x
>> +     veor            TMP0, Y0, X0
>> +     veor            TMP1, Y1, X1
>> +     veor            TMP2, Y2, X2
>> +     veor            TMP3, Y3, X3
>> +
>> +     // y = ror(y, 3)
>> +     vshr.u\n        Y0, TMP0, #3
>> +     vshr.u\n        Y1, TMP1, #3
>> +     vshr.u\n        Y2, TMP2, #3
>> +     vshr.u\n        Y3, TMP3, #3
>> +     vsli.u\n        Y0, TMP0, #(\n - 3)
>> +     vsli.u\n        Y1, TMP1, #(\n - 3)
>> +     vsli.u\n        Y2, TMP2, #(\n - 3)
>> +     vsli.u\n        Y3, TMP3, #(\n - 3)
>> +
>> +     // x ^= k
>> +     veor            X0, ROUND_KEY
>> +     veor            X1, ROUND_KEY
>> +     veor            X2, ROUND_KEY
>> +     veor            X3, ROUND_KEY
>> +
>> +     // x -= y
>> +     vsub.u\n        X0, Y0
>> +     vsub.u\n        X1, Y1
>> +     vsub.u\n        X2, Y2
>> +     vsub.u\n        X3, Y3
>> +
>> +     // x = rol(x, 8);
>> +     vtbl.8          X0_L, {X0_L}, ROTATE_TABLE
>> +     vtbl.8          X0_H, {X0_H}, ROTATE_TABLE
>> +     vtbl.8          X1_L, {X1_L}, ROTATE_TABLE
>> +     vtbl.8          X1_H, {X1_H}, ROTATE_TABLE
>> +     vtbl.8          X2_L, {X2_L}, ROTATE_TABLE
>> +     vtbl.8          X2_H, {X2_H}, ROTATE_TABLE
>> +     vtbl.8          X3_L, {X3_L}, ROTATE_TABLE
>> +     vtbl.8          X3_H, {X3_H}, ROTATE_TABLE
>> +.endm
>> +
>> +.macro _xts128_precrypt_one  dst_reg, tweak_buf, tmp
>> +
>> +     // Load the next source block
>> +     vld1.8          {\dst_reg}, [SRC]!
>> +
>> +     // Save the current tweak in the tweak buffer
>> +     vst1.8          {TWEAKV}, [\tweak_buf:128]!
>> +
>> +     // XOR the next source block with the current tweak
>> +     veor            \dst_reg, TWEAKV
>> +
>> +     /*
>> +      * Calculate the next tweak by multiplying the current one by x,
>> +      * modulo p(x) = x^128 + x^7 + x^2 + x + 1.
>> +      */
>> +     vshr.u64        \tmp, TWEAKV, #63
>> +     vshl.u64        TWEAKV, #1
>> +     veor            TWEAKV_H, \tmp\()_L
>> +     vtbl.8          \tmp\()_H, {GF128MUL_TABLE}, \tmp\()_H
>> +     veor            TWEAKV_L, \tmp\()_H
>> +.endm
>> +
>> +.macro _xts64_precrypt_two   dst_reg, tweak_buf, tmp
>> +
>> +     // Load the next two source blocks
>> +     vld1.8          {\dst_reg}, [SRC]!
>> +
>> +     // Save the current two tweaks in the tweak buffer
>> +     vst1.8          {TWEAKV}, [\tweak_buf:128]!
>> +
>> +     // XOR the next two source blocks with the current two tweaks
>> +     veor            \dst_reg, TWEAKV
>> +
>> +     /*
>> +      * Calculate the next two tweaks by multiplying the current ones by x^2,
>> +      * modulo p(x) = x^64 + x^4 + x^3 + x + 1.
>> +      */
>> +     vshr.u64        \tmp, TWEAKV, #62
>> +     vshl.u64        TWEAKV, #2
>> +     vtbl.8          \tmp\()_L, {GF64MUL_TABLE}, \tmp\()_L
>> +     vtbl.8          \tmp\()_H, {GF64MUL_TABLE}, \tmp\()_H
>> +     veor            TWEAKV, \tmp
>> +.endm
>> +
>> +/*
>> + * _speck_xts_crypt() - Speck-XTS encryption/decryption
>> + *
>> + * Encrypt or decrypt NBYTES bytes of data from the SRC buffer to the
>> DST buffer
>> + * using Speck-XTS, specifically the variant with a block size of
>> '2n' and round
>> + * count given by NROUNDS.  The expanded round keys are given in
>> ROUND_KEYS, and
>> + * the current XTS tweak value is given in TWEAK.  It's assumed that
>> NBYTES is a
>> + * nonzero multiple of 128.
>> + */
>> +.macro _speck_xts_crypt      n, decrypting
>> +     push            {r4-r7}
>> +     mov             r7, sp
>> +
>> +     /*
>> +      * The first four parameters were passed in registers r0-r3.  Load the
>> +      * additional parameters, which were passed on the stack.
>> +      */
>> +     ldr             NBYTES, [sp, #16]
>> +     ldr             TWEAK, [sp, #20]
>> +
>> +     /*
>> +      * If decrypting, modify the ROUND_KEYS parameter to point to the last
>> +      * round key rather than the first, since for decryption the round keys
>> +      * are used in reverse order.
>> +      */
>> +.if \decrypting
>> +.if \n == 64
>> +     add             ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #3
>> +     sub             ROUND_KEYS, #8
>> +.else
>> +     add             ROUND_KEYS, ROUND_KEYS, NROUNDS, lsl #2
>> +     sub             ROUND_KEYS, #4
>> +.endif
>> +.endif
>> +
>> +     // Load the index vector for vtbl-based 8-bit rotates
>> +.if \decrypting
>> +     ldr             r12, =.Lrol\n\()_8_table
>> +.else
>> +     ldr             r12, =.Lror\n\()_8_table
>> +.endif
>> +     vld1.8          {ROTATE_TABLE}, [r12:64]
>> +
>> +     // One-time XTS preparation
>> +
>> +     /*
>> +      * Allocate stack space to store 128 bytes worth of tweaks.  For
>> +      * performance, this space is aligned to a 16-byte boundary so that we
>> +      * can use the load/store instructions that declare 16-byte alignment.
>> +      */
>> +     sub             sp, #128
>> +     bic             sp, #0xf
>
>
> This fails here when building with CONFIG_THUMB2_KERNEL=y
>
>   AS      arch/arm/crypto/speck-neon-core.o
>
> arch/arm/crypto/speck-neon-core.S: Assembler messages:
>
> arch/arm/crypto/speck-neon-core.S:419: Error: r13 not allowed here --
> `bic sp,#0xf'
> arch/arm/crypto/speck-neon-core.S:423: Error: r13 not allowed here --
> `bic sp,#0xf'
> arch/arm/crypto/speck-neon-core.S:427: Error: r13 not allowed here --
> `bic sp,#0xf'
> arch/arm/crypto/speck-neon-core.S:431: Error: r13 not allowed here --
> `bic sp,#0xf'
>
> In a quick hack this change seems to address it:
>
>
> -       sub             sp, #128
> -       bic             sp, #0xf
> +       mov             r6, sp
> +       sub             r6, #128
> +       bic             r6, #0xf
> +       mov             sp, r6
>
> But there is probably a better solution to address this.
>

Given that there is no NEON on M class cores, I recommend we put something like

THUMB(bx pc)
THUMB(nop.w)
THUMB(.arm)

at the beginning and be done with it.