On 13 February 2018 at 18:57, Eric Biggers <ebiggers@xxxxxxxxxx> wrote: > Hi Ard, > > On Tue, Feb 13, 2018 at 11:34:36AM +0000, Ard Biesheuvel wrote: >> Hi Eric, >> >> On 12 February 2018 at 23:52, Eric Biggers <ebiggers@xxxxxxxxxx> wrote: >> > Add an ARM NEON-accelerated implementation of Speck-XTS. It operates on >> > 128-byte chunks at a time, i.e. 8 blocks for Speck128 or 16 blocks for >> > Speck64. Each 128-byte chunk goes through XTS preprocessing, then is >> > encrypted/decrypted (doing one cipher round for all the blocks, then the >> > next round, etc.), then goes through XTS postprocessing. >> > >> > The performance depends on the processor but can be about 3 times faster >> > than the generic code. For example, on an ARMv7 processor we observe >> > the following performance with Speck128/256-XTS: >> > >> > xts-speck128-neon: Encryption 107.9 MB/s, Decryption 108.1 MB/s >> > xts(speck128-generic): Encryption 32.1 MB/s, Decryption 36.6 MB/s >> > >> > In comparison to AES-256-XTS without the Cryptography Extensions: >> > >> > xts-aes-neonbs: Encryption 41.2 MB/s, Decryption 36.7 MB/s >> > xts(aes-asm): Encryption 31.7 MB/s, Decryption 30.8 MB/s >> > xts(aes-generic): Encryption 21.2 MB/s, Decryption 20.9 MB/s >> > >> > Speck64/128-XTS is even faster: >> > >> > xts-speck64-neon: Encryption 138.6 MB/s, Decryption 139.1 MB/s >> > >> > Note that as with the generic code, only the Speck128 and Speck64 >> > variants are supported. Also, for now only the XTS mode of operation is >> > supported, to target the disk and file encryption use cases. The NEON >> > code also only handles the portion of the data that is evenly divisible >> > into 128-byte chunks, with any remainder handled by a C fallback. Of >> > course, other modes of operation could be added later if needed, and/or >> > the NEON code could be updated to handle other buffer sizes. >> > >> > The XTS specification is only defined for AES which has a 128-bit block >> > size, so for the GF(2^64) math needed for Speck64-XTS we use the >> > reducing polynomial 'x^64 + x^4 + x^3 + x + 1' given by the original XEX >> > paper. Of course, when possible users should use Speck128-XTS, but even >> > that may be too slow on some processors; Speck64-XTS can be faster. >> > >> >> I think this is excellent work. Speck seems an appropriate solution to >> this problem, and I'm glad we are not ending up with a stream cipher >> for block encryption. >> >> Also, I think an arm64 port would be nice. I may take a stab at this >> if nobody else beats me to it. > > We don't really want to encourage people to use Speck over AES with the > Cryptography Extensions, so that's why I didn't include an arm64 port. That > being said, I suppose we can't stop people from adding an arm64 port if they > really do prefer Speck, or maybe for use on arm64 CPUs that don't have the > Cryptography Extensions (though I thought that almost all do). > Many do, but not all of them. A notable exception is the Raspberry Pi 3. >> >> I did run into an issue with this code though: On big-endian, I get >> >> [ 0.272381] alg: skcipher: Test 1 failed (invalid result) on >> encryption for xts-speck64-neon >> [ 0.276151] 00000000: 84 af 54 07 19 d4 7c a6 9c 8a ac f6 c2 14 04 d8 >> [ 0.278541] 00000010: 7f 18 6c 43 56 ed 0b b3 92 21 a2 d9 17 59 e4 3b >> >> so there may be a byte order corner case you missed in the rewrite (or >> the issue existed before, as I did not test your v1) >> > > To be honest I haven't tested either version on a big endian ARM CPU yet. I > don't really know how to do that currently; maybe it's possible with QEMU. > I tested this on a big-endian 32-bit VM running under KVM on a 64-bit host. > But assuming I haven't missed anything, in the assembly code everything is > treated as byte arrays with the exception of the round keys which are 32-bit or > 64-bit numbers in CPU endianness. The byte arrays are loaded and stored with > vld1.8 and vst1.8 while the round keys are loaded with vld1.32 or vld1.64, so > the assembly code *should* work correctly on a big endian CPU. > Indeed. > However, looking over it now, I think there is a bug in the glue code for > Speck64-XTS when it handles buffers not evenly divisible into 128 bytes. > Namely, the tweak is treated as CPU endian when it should be little endian. > Could you try the following patch? > > diff --git a/arch/arm/crypto/speck-neon-glue.c b/arch/arm/crypto/speck-neon-glue.c > index 3987dd6e063e..960cc634b36f 100644 > --- a/arch/arm/crypto/speck-neon-glue.c > +++ b/arch/arm/crypto/speck-neon-glue.c > @@ -157,7 +157,7 @@ __speck64_xts_crypt(struct skcipher_request *req, speck64_crypt_one_t crypt_one, > struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req); > const struct speck64_xts_tfm_ctx *ctx = crypto_skcipher_ctx(tfm); > struct skcipher_walk walk; > - u64 tweak; > + __le64 tweak; > int err; > > err = skcipher_walk_virt(&walk, req, true); > @@ -184,16 +184,16 @@ __speck64_xts_crypt(struct skcipher_request *req, speck64_crypt_one_t crypt_one, > } > > /* Handle any remainder with generic code */ > - while (nbytes >= sizeof(u64)) { > - *(u64 *)dst = *(u64 *)src ^ tweak; > + while (nbytes >= sizeof(__le64)) { > + *(__le64 *)dst = *(__le64 *)src ^ tweak; > (*crypt_one)(&ctx->main_key, dst, dst); > - *(u64 *)dst ^= tweak; > - tweak = (tweak << 1) ^ > - ((tweak & (1ULL << 63)) ? 0x1B : 0); > - > - dst += sizeof(u64); > - src += sizeof(u64); > - nbytes -= sizeof(u64); > + *(__le64 *)dst ^= tweak; > + tweak = cpu_to_le64((le64_to_cpu(tweak) << 1) ^ > + ((tweak & cpu_to_le64(1ULL << 63)) ? > + 0x1B : 0)); > + dst += sizeof(__le64); > + src += sizeof(__le64); > + nbytes -= sizeof(__le64); > } > err = skcipher_walk_done(&walk, nbytes); > } This fixes it. Tested-by: Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx>