On 15.07.21 г. 17:30, Matthew Wilcox wrote: > On Thu, Jul 15, 2021 at 05:13:09PM +0300, Nikolay Borisov wrote: >> Currently the comparison method vfs_dedupe_file_range_compare utilizes >> is a plain memcmp. This effectively means the code is doing byte-by-byte >> comparison. Instead, the code could do word-sized comparison without >> adverse effect on performance, provided that the comparison's length is >> at least as big as the native word size, as well as resulting memory >> addresses are properly aligned. > > Sounds to me like somebody hasn't optimised memcmp() very well ... > is this x86-64? > That was my first impression, here's the profile: │ Disassembly of section .text: │ │ ffffffff815c6f60 <memcmp>: │ memcmp(): │ test %rdx,%rdx │ ↓ je 22 │ xor %ecx,%ecx │ ↓ jmp 12 49.32 │ 9: add $0x1,%rcx 0.03 │ cmp %rcx,%rdx 11.82 │ ↓ je 21 0.01 │12: movzbl (%rdi,%rcx,1),%eax 38.19 │ movzbl (%rsi,%rcx,1),%r8d 0.59 │ sub %r8d,%eax 0.04 │ ↑ je 9 │ ← retq │21: ← retq │22: xor %eax,%eax │ ← retq It's indeed on x86-64 and according to the sources it's using __builtin_memcmp according to arch/x86/boot/string.h >> @@ -256,9 +257,35 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, >> flush_dcache_page(src_page); >> flush_dcache_page(dest_page); >> >> - if (memcmp(src_addr + src_poff, dest_addr + dest_poff, cmp_len)) >> - same = false; >> >> + if (!IS_ALIGNED((unsigned long)(src_addr + src_poff), block_size) || >> + !IS_ALIGNED((unsigned long)(dest_addr + dest_poff), block_size) || >> + cmp_len < block_size) { > > Can this even happen? Surely we can only dedup on a block boundary and > blocks are required to be a power of two and at least 512 bytes in size? I was wondering the same thing, but AFAICS it seems to be possible i.e if userspace spaces bad offsets, while all kinds of internal fs synchronization ops are going to be performed on aligned offsets, that doesn't mean the original ones, passed from userspace are themselves aligned explicitly. > >> + if (memcmp(src_addr + src_poff, dest_addr + dest_poff, >> + cmp_len)) >> + same = false; >> + } else { >> + int i; >> + size_t blocks = cmp_len / block_size; >> + loff_t rem_len = cmp_len - (blocks * block_size); >> + unsigned long *src = src_addr + src_poff; >> + unsigned long *dst = dest_addr + src_poff; >> + >> + for (i = 0; i < blocks; i++) { >> + if (src[i] - dst[i]) { >> + same = false; >> + goto finished; >> + } >> + } >> + >> + if (rem_len) { >> + src_addr += src_poff + (blocks * block_size); >> + dest_addr += dest_poff + (blocks * block_size); >> + if (memcmp(src_addr, dest_addr, rem_len)) >> + same = false; >> + } >> + } >> +finished: >> kunmap_atomic(dest_addr); >> kunmap_atomic(src_addr); >> unlock: >> -- >> 2.25.1 >> >