Re: xor_blocks() assumptions

Eric Biggers <ebiggers@xxxxxxxxxx> · Tue, 3 Jan 2023 23:46:44 -0800

On Tue, Jan 03, 2023 at 12:13:30PM +0100, Lukasz Stelmach wrote:
> > It also would be worth considering just optimizing crypto_xor() by
> > unrolling the word-at-a-time loop to 4x or so.
> 
> If I understand correctly the generic 8regs and 32regs implementations
> in include/asm-generic/xor.h are what you mean. Using xor_blocks() in
> crypto_xor() could enable them for free on architectures lacking SIMD or
> vector instructions.

I actually meant exactly what I said -- unrolling the word-at-a-time loop in
crypto_xor().  Not using xor_blocks().  Something like this:

diff --git a/include/crypto/algapi.h b/include/crypto/algapi.h
index 61b327206b557..c0b90f14cae18 100644
--- a/include/crypto/algapi.h
+++ b/include/crypto/algapi.h
@@ -167,7 +167,18 @@ static inline void crypto_xor(u8 *dst, const u8 *src, unsigned int size)
 		unsigned long *s = (unsigned long *)src;
 		unsigned long l;
 
-		while (size > 0) {
+		while (size >= 4 * sizeof(unsigned long)) {
+			l = get_unaligned(d) ^ get_unaligned(s++);
+			put_unaligned(l, d++);
+			l = get_unaligned(d) ^ get_unaligned(s++);
+			put_unaligned(l, d++);
+			l = get_unaligned(d) ^ get_unaligned(s++);
+			put_unaligned(l, d++);
+			l = get_unaligned(d) ^ get_unaligned(s++);
+			put_unaligned(l, d++);
+			size -= 4 * sizeof(unsigned long);
+		}
+		if (size > 0) {
 			l = get_unaligned(d) ^ get_unaligned(s++);
 			put_unaligned(l, d++);
 			size -= sizeof(unsigned long);

Actually, the compiler might unroll the loop automatically anyway, so even the
above change might not even be necessary.  The point is, I expect that a proper
scalar implementation will perform well for pretty much anything other than
large input sizes.

It's only large input sizes where xor_blocks() might be worth it, considering
the significant overhead of the indirect call in xor_blocks() as well as
entering an SIMD code section.  (Note that indirect calls are very expensive
these days, due to the speculative execution mitigations.)

Of course, the real question is what real-world scenario are you actually trying
to optimize for...

- Eric