On Wed, Sep 5, 2018 at 7:32 AM, Eric Biggers <ebiggers@xxxxxxxxxx> wrote: > Note that if ever needed there's also still room for optimizing the GF(2^128) > multiplications further, e.g. multiplying by 'x' and 'x^2' in parallel, or maybe > having a version specialized for 32-bit processors. Given that this is used to encrypt small buffers only, skipping ahead seems like it may also be a viable strategy. For example, for the XTS polynomial x^128 + x^7 + x^2 + x + 1 one can multiply by x^64 very efficiently with u128 skip64(u128 x) { u128 b64 = (x >> 64); u128 b63 = (x >> 63) & ~(u128)0x01; u128 b62 = (x >> 62) & ~(u128)0x03; u128 b57 = (x >> 57) & ~(u128)0x7f; return (x << 64) ^ (b64 ^ b63 ^ b62 ^ b57); } Calling this twice skips exactly 128 blocks, in which case we can xor both halves of a 4096-byte sector in parallel.