On Wed, Jul 21, 2021 at 11:17 AM Nikolay Borisov <nborisov@xxxxxxxx> wrote: > > I find it somewhat arbitrary that we choose to align the 2nd pointer and > not the first. Yeah, that's a bit odd, but I don't think it matters. The hope is obviously that they are mutually aligned, and in that case it doesn't matter which one you aim to align. > So you are saying that the current memcmp could indeed use improvement > but you don't want it to be based on the glibc's code due to the ugly > misalignment handling? Yeah. I suspect that this (very simple) patch gives you the same performance improvement that the glibc code does. NOTE! I'm not saying this patch is perfect. This one doesn't even _try_ to do the mutual alignment, because it's really silly. But I'm throwing this out here for discussion, because - it's really simple - I suspect it gets you 99% of the way there - the code generation is actually quite good with both gcc and clang. This is gcc: memcmp: jmp .L60 .L52: movq (%rsi), %rax cmpq %rax, (%rdi) jne .L53 addq $8, %rdi addq $8, %rsi subq $8, %rdx .L60: cmpq $7, %rdx ja .L52 testq %rdx, %rdx je .L61 .L53: xorl %ecx, %ecx jmp .L56 .L62: addq $1, %rcx cmpq %rcx, %rdx je .L51 .L56: movzbl (%rdi,%rcx), %eax movzbl (%rsi,%rcx), %r8d subl %r8d, %eax je .L62 .L51: ret .L61: xorl %eax, %eax ret and notice how there are no spills, no extra garbage, just simple and straightforward code. Those things ends mattering too - it's good for I$, it's good for the small cases, and it's good for debugging and reading the code. If this is "good enough" for your test-case, I really would prefer something like this. "Make it as simple as possible, but no simpler" I can do the mutual alignment too, but I'd actually prefer to do it as a separate patch, for when there are numbers for that. And I wouldn't do it as a byte-by-byte case, because that's just stupid. I'd do it using a separate first single "get unaligned word from both sources, compare them for equality, and then only add enough bytes to align" Linus
lib/string.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/lib/string.c b/lib/string.c index 77bd0b1d3296..b2de45a581f4 100644 --- a/lib/string.c +++ b/lib/string.c @@ -29,6 +29,7 @@ #include <linux/errno.h> #include <linux/slab.h> +#include <asm/unaligned.h> #include <asm/byteorder.h> #include <asm/word-at-a-time.h> #include <asm/page.h> @@ -935,6 +936,21 @@ __visible int memcmp(const void *cs, const void *ct, size_t count) const unsigned char *su1, *su2; int res = 0; +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS + if (count >= sizeof(unsigned long)) { + const unsigned long *u1 = cs; + const unsigned long *u2 = ct; + do { + if (get_unaligned(u1) != get_unaligned(u2)) + break; + u1++; + u2++; + count -= sizeof(unsigned long); + } while (count >= sizeof(unsigned long)); + cs = u1; + ct = u2; + } +#endif for (su1 = cs, su2 = ct; 0 < count; ++su1, ++su2, count--) if ((res = *su1 - *su2) != 0) break;