On Thu, Jul 15, 2021 at 05:13:09PM +0300, Nikolay Borisov wrote: > Currently the comparison method vfs_dedupe_file_range_compare utilizes > is a plain memcmp. This effectively means the code is doing byte-by-byte > comparison. Instead, the code could do word-sized comparison without > adverse effect on performance, provided that the comparison's length is > at least as big as the native word size, as well as resulting memory > addresses are properly aligned. Sounds to me like somebody hasn't optimised memcmp() very well ... is this x86-64? > @@ -256,9 +257,35 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, > flush_dcache_page(src_page); > flush_dcache_page(dest_page); > > - if (memcmp(src_addr + src_poff, dest_addr + dest_poff, cmp_len)) > - same = false; > > + if (!IS_ALIGNED((unsigned long)(src_addr + src_poff), block_size) || > + !IS_ALIGNED((unsigned long)(dest_addr + dest_poff), block_size) || > + cmp_len < block_size) { Can this even happen? Surely we can only dedup on a block boundary and blocks are required to be a power of two and at least 512 bytes in size? > + if (memcmp(src_addr + src_poff, dest_addr + dest_poff, > + cmp_len)) > + same = false; > + } else { > + int i; > + size_t blocks = cmp_len / block_size; > + loff_t rem_len = cmp_len - (blocks * block_size); > + unsigned long *src = src_addr + src_poff; > + unsigned long *dst = dest_addr + src_poff; > + > + for (i = 0; i < blocks; i++) { > + if (src[i] - dst[i]) { > + same = false; > + goto finished; > + } > + } > + > + if (rem_len) { > + src_addr += src_poff + (blocks * block_size); > + dest_addr += dest_poff + (blocks * block_size); > + if (memcmp(src_addr, dest_addr, rem_len)) > + same = false; > + } > + } > +finished: > kunmap_atomic(dest_addr); > kunmap_atomic(src_addr); > unlock: > -- > 2.25.1 >