From: Linus Torvalds > Sent: 21 July 2021 19:46 > > On Wed, Jul 21, 2021 at 11:17 AM Nikolay Borisov <nborisov@xxxxxxxx> wrote: > > > > I find it somewhat arbitrary that we choose to align the 2nd pointer and > > not the first. > > Yeah, that's a bit odd, but I don't think it matters. > > The hope is obviously that they are mutually aligned, and in that case > it doesn't matter which one you aim to align. > > > So you are saying that the current memcmp could indeed use improvement > > but you don't want it to be based on the glibc's code due to the ugly > > misalignment handling? > > Yeah. I suspect that this (very simple) patch gives you the same > performance improvement that the glibc code does. > > NOTE! I'm not saying this patch is perfect. This one doesn't even > _try_ to do the mutual alignment, because it's really silly. But I'm > throwing this out here for discussion, because > > - it's really simple > > - I suspect it gets you 99% of the way there > > - the code generation is actually quite good with both gcc and clang. > This is gcc: > > memcmp: > jmp .L60 > .L52: > movq (%rsi), %rax > cmpq %rax, (%rdi) > jne .L53 > addq $8, %rdi > addq $8, %rsi > subq $8, %rdx > .L60: > cmpq $7, %rdx > ja .L52 I wonder how fast that can be made to run. I think the two conditional branches have to run in separate clocks. So you may get all 5 arithmetic operations to run in the same 2 clocks. But that may be pushing things on everything except the very latest cpu. The memory reads aren't limiting at all, the cpu can do two per clock. So even though (IIRC) misaligned ones cost an extra clock it doesn't matter. That looks like a +dst++ = *src++ loop. The array copy dst[i] = src[i]; i++ requires one less 'addq' provided the cpu has 'register + register' addressing. Not decrementing the length also saves an 'addq'. So the loop: for (i = 0; i < length - 7; i += 8) dst[i] = src[i]; /* Hacked to be right in C */ probably only has one addq and one cmpq per iteration. That is much more likely to run in the 2 clocks. (If you can persuade gcc not to transform it!) It may also be possible to remove the cmpq by arranging that the flags from the addq contain the right condition. That needs something like: dst += len; src += len; len = -len do dst[len] = src[len]; while ((len += 8) < 0); That probably isn't necessary for x86, but is likely to help sparc. For mips-like cpu (with 'compare and jump', only 'reg + constant' addressing) you really want a loop like: dst_end = dst + length; do *dst++ = *src++; while (dst < dst_end); This has two adds and a compare per iteration. That might be a good compromise for aligned copies. I'm not at all sure is it ever worth aligning either pointer if misaligned reads don't fault. Most compares (of any size) will be aligned. So you get the 'hit' of the test when it cannot help. That almost certainly exceeds any benefit. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)