Re: [PATCH] lib/string: Bring optimized memcmp from glibc

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Thu, 22 Jul 2021 09:40:43 -0700

On Thu, Jul 22, 2021 at 4:28 AM Nikolay Borisov
<n.borisov.lkml@xxxxxxxxx> wrote:
>
> This one also works, tested only on x86-64. Looking at the perf diff:
>
>     30.44%    -28.66%  [kernel.vmlinux]         [k] memcmp

Ok.

So the one that doesn't even bother to align is

    30.44%    -29.38%  [kernel.vmlinux]         [k] memcmp

and the one that aligns the first one is

    30.44%    -28.66%  [kernel.vmlinux]         [k] memcmp

and the difference between the two is basically in the noise:

     1.05%     +0.72%  [kernel.vmlinux]     [k] memcmp

but the first one does seem to be slightly better.

> Now on a more practical note, IIUC your 2nd version makes sense if the
> cost of doing a one unaligned access in the loop body is offset by the
> fact we are doing a native word-sized comparison, right?

So honestly, the reason the first one seems to beat the second one is
that the cost of unaligned accesses on modern x86 is basically
epsilon.

For all normal unaligned accesses there simply is no cost at all.
There is a _tiny_ cost when the unaligned access crosses a cacheline
access boundary (which on older CPU's is every 32 bytes, on modern
ones it's 64 bytes). And then there is another slightly bigger cost
when the unaligned access actually crosses a page boundary.

But even those non-zero cost cases are basically in the noise, because
under most circumstances they will be hidden by any out-of-order
engine, and completely dwarfed by the _real_ costs which are branch
mispredicts and cache misses.

So on the whole, unaligned accesses are basically no cost at all. You
almost have to have unusual code sequences for them to be even
measurable.

So that second patch that aligns one of the sources is basically only
extra overhead for no real advantage. The cost of it is probably one
more branch mispredict, and possibly a cycle or two for the extra
instructions.

Which is why the first one wins: it's simpler, and the extra work the
second one does is basically not worth it on x86. Plus I suspect your
test-case was all aligned anyway to begin with, so the extra work is
_doubly_ pointless.

I suspect the second patch would be worthwhile if

 (a) there really were a lot of strings that weren't aligned (likelihood: low)

 (b) other microarchitectures that do worse on unaligned accesses -
some microarchitectures spend extra cycles on _any_ unaligned accesses
even if they don't cross cache access boundaries etc.

and I can see (b) happening quite easily. You just won't see it on a
modern x86-64 setup.

I suspect we should start with the first version. It's not only better
on x86, but it's simpler, and it's guarded by that

    #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS

so it's fundamentally "safer" on architectures that are just horrible
about unaligned accesses.

Now, admittedly I don't personally even care about such architectures,
and because we use "get_unaligned()", the compiler should always
generate code that doesn't absolutely suck for bad architectures, but
considering how long we've gone with the completely brainlessly simple
"byte at a time" version without anybody even noticing, I think a
minimal change is a better change.

That said, I'm not convinced I want to apply even that minimal first
patch outside of the merge window.

So would you mind reminding me about this patch the next merge window?
Unless there was some big extrernal reason why the performance of
memcmp() mattered to you so much (ie some user that actually noticed
and complained) and we really want to prioritize this..

              Linus