On Wed, Sep 19, 2018 at 3:08 AM Eric Biggers <ebiggers@xxxxxxxxxx> wrote: > Does this consistently perform as well as an implementation that organizes the > operations such that the quarterrounds for all columns/diagonals are > interleaved? As-is, there are tight dependencies in QUARTER_ROUND() (as well as > in the existing chacha20_block() in lib/chacha20.c, for that matter), so we're > heavily depending on the compiler to do the needed interleaving so as to not get > potentially disastrous performance. Making it explicit could be a good idea. It does perform as well, and the compiler outputs good code, even on older compilers. Notably that's all a single statement (via the comma operator). > > +} > > + > > +static void chacha20_generic(u8 *out, const u8 *in, u32 len, const u32 key[8], > > + const u32 counter[4]) > > +{ > > + __le32 buf[CHACHA20_BLOCK_WORDS]; > > + u32 x[] = { > > + EXPAND_32_BYTE_K, > > + key[0], key[1], key[2], key[3], > > + key[4], key[5], key[6], key[7], > > + counter[0], counter[1], counter[2], counter[3] > > + }; > > + > > + if (out != in) > > + memmove(out, in, len); > > + > > + while (len >= CHACHA20_BLOCK_SIZE) { > > + chacha20_block_generic(buf, x); > > + crypto_xor(out, (u8 *)buf, CHACHA20_BLOCK_SIZE); > > + len -= CHACHA20_BLOCK_SIZE; > > + out += CHACHA20_BLOCK_SIZE; > > + } > > + if (len) { > > + chacha20_block_generic(buf, x); > > + crypto_xor(out, (u8 *)buf, len); > > + } > > +} > > If crypto_xor_cpy() is used instead of crypto_xor(), and 'in' is incremented > along with 'out', then the memmove() is not needed. Nice idea, thanks. Implemented. Jason