On Tue, Jan 03, 2023 at 12:13:30PM +0100, Lukasz Stelmach wrote: > > It also would be worth considering just optimizing crypto_xor() by > > unrolling the word-at-a-time loop to 4x or so. > > If I understand correctly the generic 8regs and 32regs implementations > in include/asm-generic/xor.h are what you mean. Using xor_blocks() in > crypto_xor() could enable them for free on architectures lacking SIMD or > vector instructions. I actually meant exactly what I said -- unrolling the word-at-a-time loop in crypto_xor(). Not using xor_blocks(). Something like this: diff --git a/include/crypto/algapi.h b/include/crypto/algapi.h index 61b327206b557..c0b90f14cae18 100644 --- a/include/crypto/algapi.h +++ b/include/crypto/algapi.h @@ -167,7 +167,18 @@ static inline void crypto_xor(u8 *dst, const u8 *src, unsigned int size) unsigned long *s = (unsigned long *)src; unsigned long l; - while (size > 0) { + while (size >= 4 * sizeof(unsigned long)) { + l = get_unaligned(d) ^ get_unaligned(s++); + put_unaligned(l, d++); + l = get_unaligned(d) ^ get_unaligned(s++); + put_unaligned(l, d++); + l = get_unaligned(d) ^ get_unaligned(s++); + put_unaligned(l, d++); + l = get_unaligned(d) ^ get_unaligned(s++); + put_unaligned(l, d++); + size -= 4 * sizeof(unsigned long); + } + if (size > 0) { l = get_unaligned(d) ^ get_unaligned(s++); put_unaligned(l, d++); size -= sizeof(unsigned long); Actually, the compiler might unroll the loop automatically anyway, so even the above change might not even be necessary. The point is, I expect that a proper scalar implementation will perform well for pretty much anything other than large input sizes. It's only large input sizes where xor_blocks() might be worth it, considering the significant overhead of the indirect call in xor_blocks() as well as entering an SIMD code section. (Note that indirect calls are very expensive these days, due to the speculative execution mitigations.) Of course, the real question is what real-world scenario are you actually trying to optimize for... - Eric