Re: [PATCH v2 3/5] crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS

Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> · Tue, 13 Feb 2018 19:04:30 +0000

On 13 February 2018 at 18:57, Eric Biggers <ebiggers@xxxxxxxxxx> wrote:
> Hi Ard,
>
> On Tue, Feb 13, 2018 at 11:34:36AM +0000, Ard Biesheuvel wrote:
>> Hi Eric,
>>
>> On 12 February 2018 at 23:52, Eric Biggers <ebiggers@xxxxxxxxxx> wrote:
>> > Add an ARM NEON-accelerated implementation of Speck-XTS.  It operates on
>> > 128-byte chunks at a time, i.e. 8 blocks for Speck128 or 16 blocks for
>> > Speck64.  Each 128-byte chunk goes through XTS preprocessing, then is
>> > encrypted/decrypted (doing one cipher round for all the blocks, then the
>> > next round, etc.), then goes through XTS postprocessing.
>> >
>> > The performance depends on the processor but can be about 3 times faster
>> > than the generic code.  For example, on an ARMv7 processor we observe
>> > the following performance with Speck128/256-XTS:
>> >
>> >     xts-speck128-neon:     Encryption 107.9 MB/s, Decryption 108.1 MB/s
>> >     xts(speck128-generic): Encryption  32.1 MB/s, Decryption  36.6 MB/s
>> >
>> > In comparison to AES-256-XTS without the Cryptography Extensions:
>> >
>> >     xts-aes-neonbs:        Encryption  41.2 MB/s, Decryption  36.7 MB/s
>> >     xts(aes-asm):          Encryption  31.7 MB/s, Decryption  30.8 MB/s
>> >     xts(aes-generic):      Encryption  21.2 MB/s, Decryption  20.9 MB/s
>> >
>> > Speck64/128-XTS is even faster:
>> >
>> >     xts-speck64-neon:      Encryption 138.6 MB/s, Decryption 139.1 MB/s
>> >
>> > Note that as with the generic code, only the Speck128 and Speck64
>> > variants are supported.  Also, for now only the XTS mode of operation is
>> > supported, to target the disk and file encryption use cases.  The NEON
>> > code also only handles the portion of the data that is evenly divisible
>> > into 128-byte chunks, with any remainder handled by a C fallback.  Of
>> > course, other modes of operation could be added later if needed, and/or
>> > the NEON code could be updated to handle other buffer sizes.
>> >
>> > The XTS specification is only defined for AES which has a 128-bit block
>> > size, so for the GF(2^64) math needed for Speck64-XTS we use the
>> > reducing polynomial 'x^64 + x^4 + x^3 + x + 1' given by the original XEX
>> > paper.  Of course, when possible users should use Speck128-XTS, but even
>> > that may be too slow on some processors; Speck64-XTS can be faster.
>> >
>>
>> I think this is excellent work. Speck seems an appropriate solution to
>> this problem, and I'm glad we are not ending up with a stream cipher
>> for block encryption.
>>
>> Also, I think an arm64 port would be nice. I may take a stab at this
>> if nobody else beats me to it.
>
> We don't really want to encourage people to use Speck over AES with the
> Cryptography Extensions, so that's why I didn't include an arm64 port.  That
> being said, I suppose we can't stop people from adding an arm64 port if they
> really do prefer Speck, or maybe for use on arm64 CPUs that don't have the
> Cryptography Extensions (though I thought that almost all do).
>

Many do, but not all of them. A notable exception is the Raspberry Pi 3.

>>
>> I did run into an issue with this code though: On big-endian, I get
>>
>> [ 0.272381] alg: skcipher: Test 1 failed (invalid result) on
>> encryption for xts-speck64-neon
>> [    0.276151] 00000000: 84 af 54 07 19 d4 7c a6 9c 8a ac f6 c2 14 04 d8
>> [    0.278541] 00000010: 7f 18 6c 43 56 ed 0b b3 92 21 a2 d9 17 59 e4 3b
>>
>> so there may be a byte order corner case you missed in the rewrite (or
>> the issue existed before, as I did not test your v1)
>>
>
> To be honest I haven't tested either version on a big endian ARM CPU yet.  I
> don't really know how to do that currently; maybe it's possible with QEMU.
>

I tested this on a big-endian 32-bit VM running under KVM on a 64-bit host.

> But assuming I haven't missed anything, in the assembly code everything is
> treated as byte arrays with the exception of the round keys which are 32-bit or
> 64-bit numbers in CPU endianness.  The byte arrays are loaded and stored with
> vld1.8 and vst1.8 while the round keys are loaded with vld1.32 or vld1.64, so
> the assembly code *should* work correctly on a big endian CPU.
>

Indeed.

> However, looking over it now, I think there is a bug in the glue code for
> Speck64-XTS when it handles buffers not evenly divisible into 128 bytes.
> Namely, the tweak is treated as CPU endian when it should be little endian.
> Could you try the following patch?
>
> diff --git a/arch/arm/crypto/speck-neon-glue.c b/arch/arm/crypto/speck-neon-glue.c
> index 3987dd6e063e..960cc634b36f 100644
> --- a/arch/arm/crypto/speck-neon-glue.c
> +++ b/arch/arm/crypto/speck-neon-glue.c
> @@ -157,7 +157,7 @@ __speck64_xts_crypt(struct skcipher_request *req, speck64_crypt_one_t crypt_one,
>         struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
>         const struct speck64_xts_tfm_ctx *ctx = crypto_skcipher_ctx(tfm);
>         struct skcipher_walk walk;
> -       u64 tweak;
> +       __le64 tweak;
>         int err;
>
>         err = skcipher_walk_virt(&walk, req, true);
> @@ -184,16 +184,16 @@ __speck64_xts_crypt(struct skcipher_request *req, speck64_crypt_one_t crypt_one,
>                 }
>
>                 /* Handle any remainder with generic code */
> -               while (nbytes >= sizeof(u64)) {
> -                       *(u64 *)dst = *(u64 *)src ^ tweak;
> +               while (nbytes >= sizeof(__le64)) {
> +                       *(__le64 *)dst = *(__le64 *)src ^ tweak;
>                         (*crypt_one)(&ctx->main_key, dst, dst);
> -                       *(u64 *)dst ^= tweak;
> -                       tweak = (tweak << 1) ^
> -                               ((tweak & (1ULL << 63)) ? 0x1B : 0);
> -
> -                       dst += sizeof(u64);
> -                       src += sizeof(u64);
> -                       nbytes -= sizeof(u64);
> +                       *(__le64 *)dst ^= tweak;
> +                       tweak = cpu_to_le64((le64_to_cpu(tweak) << 1) ^
> +                                           ((tweak & cpu_to_le64(1ULL << 63)) ?
> +                                            0x1B : 0));
> +                       dst += sizeof(__le64);
> +                       src += sizeof(__le64);
> +                       nbytes -= sizeof(__le64);
>                 }
>                 err = skcipher_walk_done(&walk, nbytes);
>         }

This fixes it.

Tested-by: Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx>