On Thu, Jul 22, 2021 at 4:28 AM Nikolay Borisov <n.borisov.lkml@xxxxxxxxx> wrote: > > This one also works, tested only on x86-64. Looking at the perf diff: > > 30.44% -28.66% [kernel.vmlinux] [k] memcmp Ok. So the one that doesn't even bother to align is 30.44% -29.38% [kernel.vmlinux] [k] memcmp and the one that aligns the first one is 30.44% -28.66% [kernel.vmlinux] [k] memcmp and the difference between the two is basically in the noise: 1.05% +0.72% [kernel.vmlinux] [k] memcmp but the first one does seem to be slightly better. > Now on a more practical note, IIUC your 2nd version makes sense if the > cost of doing a one unaligned access in the loop body is offset by the > fact we are doing a native word-sized comparison, right? So honestly, the reason the first one seems to beat the second one is that the cost of unaligned accesses on modern x86 is basically epsilon. For all normal unaligned accesses there simply is no cost at all. There is a _tiny_ cost when the unaligned access crosses a cacheline access boundary (which on older CPU's is every 32 bytes, on modern ones it's 64 bytes). And then there is another slightly bigger cost when the unaligned access actually crosses a page boundary. But even those non-zero cost cases are basically in the noise, because under most circumstances they will be hidden by any out-of-order engine, and completely dwarfed by the _real_ costs which are branch mispredicts and cache misses. So on the whole, unaligned accesses are basically no cost at all. You almost have to have unusual code sequences for them to be even measurable. So that second patch that aligns one of the sources is basically only extra overhead for no real advantage. The cost of it is probably one more branch mispredict, and possibly a cycle or two for the extra instructions. Which is why the first one wins: it's simpler, and the extra work the second one does is basically not worth it on x86. Plus I suspect your test-case was all aligned anyway to begin with, so the extra work is _doubly_ pointless. I suspect the second patch would be worthwhile if (a) there really were a lot of strings that weren't aligned (likelihood: low) (b) other microarchitectures that do worse on unaligned accesses - some microarchitectures spend extra cycles on _any_ unaligned accesses even if they don't cross cache access boundaries etc. and I can see (b) happening quite easily. You just won't see it on a modern x86-64 setup. I suspect we should start with the first version. It's not only better on x86, but it's simpler, and it's guarded by that #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS so it's fundamentally "safer" on architectures that are just horrible about unaligned accesses. Now, admittedly I don't personally even care about such architectures, and because we use "get_unaligned()", the compiler should always generate code that doesn't absolutely suck for bad architectures, but considering how long we've gone with the completely brainlessly simple "byte at a time" version without anybody even noticing, I think a minimal change is a better change. That said, I'm not convinced I want to apply even that minimal first patch outside of the merge window. So would you mind reminding me about this patch the next merge window? Unless there was some big extrernal reason why the performance of memcmp() mattered to you so much (ie some user that actually noticed and complained) and we really want to prioritize this.. Linus