On Mon, Jan 30, 2017 at 02:11:29PM +0000, Ard Biesheuvel wrote: > Instead of unconditionally forcing 4 byte alignment for all generic > chaining modes that rely on crypto_xor() or crypto_inc() (which may > result in unnecessary copying of data when the underlying hardware > can perform unaligned accesses efficiently), make those functions > deal with unaligned input explicitly, but only if the Kconfig symbol > HAVE_EFFICIENT_UNALIGNED_ACCESS is set. This will allow us to drop > the alignmasks from the CBC, CMAC, CTR, CTS, PCBC and SEQIV drivers. > > For crypto_inc(), this simply involves making the 4-byte stride > conditional on HAVE_EFFICIENT_UNALIGNED_ACCESS being set, given that > it typically operates on 16 byte buffers. > > For crypto_xor(), an algorithm is implemented that simply runs through > the input using the largest strides possible if unaligned accesses are > allowed. If they are not, an optimal sequence of memory accesses is > emitted that takes the relative alignment of the input buffers into > account, e.g., if the relative misalignment of dst and src is 4 bytes, > the entire xor operation will be completed using 4 byte loads and stores > (modulo unaligned bits at the start and end). Note that all expressions > involving startalign and misalign are simply eliminated by the compiler > if HAVE_EFFICIENT_UNALIGNED_ACCESS is defined. > Hi Ard, This is a good idea, and I think it was error-prone to be requiring 4-byte alignment always, and also inefficient on many architectures. The new crypto_inc() looks fine, but the new crypto_xor() is quite complicated. I'm wondering whether it has to be that way, especially since it seems to most commonly be used on very small input buffers, e.g. 8 or 16-byte blocks. There are a couple trivial ways it could be simplified, e.g. using 'dst' and 'src' directly instead of 'a' and 'b' (which also seems to improve code generation by getting rid of the '+= len & ~mask' parts), or using sizeof(long) directly instead of 'size' and 'mask'. But also when I tried testing the proposed crypto_xor() on MIPS, it didn't work correctly on a misaligned buffer. With startalign=1, it did one iteration of the following loop and then exited with startalign=0 and entered the "unsigned long at a time" loop, which is incorrect since at that point the buffers were not yet fully aligned: > do { > if (len < sizeof(u8)) > break; > > if (len >= size && !(startalign & 1) && !(misalign & 1)) > break; > > *dst++ ^= *src++; > len -= sizeof(u8); > startalign &= ~sizeof(u8); > } while (misalign & 1); I think it would need to do instead: startalign += sizeof(u8); startalign %= sizeof(unsigned long); But I am wondering whether you considered something simpler, using the get_unaligned/put_unaligned helpers, maybe even using a switch statement for the last (sizeof(long) - 1) bytes so it can be compiled as a jump table. Something like this: #define xor_unaligned(dst, src) \ put_unaligned(get_unaligned(dst) ^ get_unaligned(src), (dst)) void crypto_xor(u8 *dst, const u8 *src, unsigned int len) { while (len >= sizeof(unsigned long)) { xor_unaligned((unsigned long *)dst, (unsigned long *)src); dst += sizeof(unsigned long); src += sizeof(unsigned long); len -= sizeof(unsigned long); } switch (len) { #ifdef CONFIG_64BIT case 7: dst[6] ^= src[6]; /* fall through */ case 6: xor_unaligned((u16 *)&dst[4], (u16 *)&src[4]); goto len_4; case 5: dst[4] ^= src[4]; /* fall through */ case 4: len_4: xor_unaligned((u32 *)dst, (u32 *)src); break; #endif case 3: dst[2] ^= src[2]; /* fall through */ case 2: xor_unaligned((u16 *)dst, (u16 *)src); break; case 1: dst[0] ^= src[0]; break; } } That would seem like a better choice for small buffers, which seems to be the more common case. It should generate slightly faster code on architectures with fast unaligned access like x86_64, while still being sufficient on architectures without --- perhaps even faster, since it wouldn't have as many branches. Eric