On 15.07.21 г. 18:09, Matthew Wilcox wrote: > On Thu, Jul 15, 2021 at 05:44:15PM +0300, Nikolay Borisov wrote: >> That was my first impression, here's the profile: >> >> │ Disassembly of section .text: >> │ >> │ ffffffff815c6f60 <memcmp>: >> │ memcmp(): >> │ test %rdx,%rdx >> │ ↓ je 22 >> │ xor %ecx,%ecx >> │ ↓ jmp 12 >> 49.32 │ 9: add $0x1,%rcx >> 0.03 │ cmp %rcx,%rdx >> 11.82 │ ↓ je 21 >> 0.01 │12: movzbl (%rdi,%rcx,1),%eax >> 38.19 │ movzbl (%rsi,%rcx,1),%r8d >> 0.59 │ sub %r8d,%eax >> 0.04 │ ↑ je 9 > > That looks like a byte loop to me ... > >> It's indeed on x86-64 and according to the sources it's using >> __builtin_memcmp according to arch/x86/boot/string.h > > I think the 'boot' part of that path might indicate that it's not what's > actually being used by the kernel. > > $ git grep __HAVE_ARCH_MEMCMP > arch/arc/include/asm/string.h:#define __HAVE_ARCH_MEMCMP > arch/arm64/include/asm/string.h:#define __HAVE_ARCH_MEMCMP > arch/csky/abiv2/inc/abi/string.h:#define __HAVE_ARCH_MEMCMP > arch/powerpc/include/asm/string.h:#define __HAVE_ARCH_MEMCMP > arch/s390/include/asm/string.h:#define __HAVE_ARCH_MEMCMP /* arch function */ > arch/s390/lib/string.c:#ifdef __HAVE_ARCH_MEMCMP > arch/s390/purgatory/string.c:#define __HAVE_ARCH_MEMCMP /* arch function */ > arch/sparc/include/asm/string.h:#define __HAVE_ARCH_MEMCMP > include/linux/string.h:#ifndef __HAVE_ARCH_MEMCMP > lib/string.c:#ifndef __HAVE_ARCH_MEMCMP > > So I think x86-64 is using the stupid one. > >>> Can this even happen? Surely we can only dedup on a block boundary and >>> blocks are required to be a power of two and at least 512 bytes in size? >> >> I was wondering the same thing, but AFAICS it seems to be possible i.e >> if userspace spaces bad offsets, while all kinds of internal fs >> synchronization ops are going to be performed on aligned offsets, that >> doesn't mean the original ones, passed from userspace are themselves >> aligned explicitly. > > Ah, I thought it'd be failed before we got to this point. > > But honestly, I think x86-64 needs to be fixed to either use > __builtin_memcmp() or to have a nicely written custom memcmp(). I > tried to find the gcc implementation of __builtin_memcmp() on > x86-64, but I can't. __builtin_memcmp is a no go due to this function being an ifunc [0] and is being resolved to a call to memcmp which causes link => build failures. So what remains is to either patch particular sites or as Dave suggested have a generic optimized implementation. glibc's implementation [1] seems straightforward enough to be convertible to kernel style. However it would need definitive proof it actually improves performance in a variety of scenarios. [0] https://sourceware.org/glibc/wiki/GNU_IFUNC [1] https://sourceware.org/git/?p=glibc.git;a=blob;f=string/memcmp.c >