Currently the comparison method vfs_dedupe_file_range_compare utilizes is a plain memcmp. This effectively means the code is doing byte-by-byte comparison. Instead, the code could do word-sized comparison without adverse effect on performance, provided that the comparison's length is at least as big as the native word size, as well as resulting memory addresses are properly aligned. On a workload consisting of running duperemove (a userspace program doing deduplication of duplicated extents) on a fully-duplicated dataset, consisting of 80g spread among 20k 4m files I get the following results: Unpatched: Patched: real 21m45.275s 21m14.445s user 0m0.986s 0m0.933s sys 1m30.734s 1m8.900s (-25%) Notable changes in the perf profiles: .... omitted for brevity .... 0.29% +1.01% [kernel.vmlinux] [k] vfs_dedupe_file_range_compare.constprop.0 23.62% [kernel.vmlinux] [k] memcmp .... omitted for brevity .... The memcmp is being eliminated altogether and instead is replaced by the newly introduced loop in vfs_dedupe_file_range_compare, hence the increase of cycles spent there by 1%. Signed-off-by: Nikolay Borisov <nborisov@xxxxxxxx> --- fs/remap_range.c | 31 +++++++++++++++++++++++++++++-- 1 file changed, 29 insertions(+), 2 deletions(-) diff --git a/fs/remap_range.c b/fs/remap_range.c index e4a5fdd7ad7b..041e03b082ed 100644 --- a/fs/remap_range.c +++ b/fs/remap_range.c @@ -212,6 +212,7 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, loff_t cmp_len; bool same; int error; + const uint8_t block_size = sizeof(unsigned long); error = -EINVAL; same = true; @@ -256,9 +257,35 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, flush_dcache_page(src_page); flush_dcache_page(dest_page); - if (memcmp(src_addr + src_poff, dest_addr + dest_poff, cmp_len)) - same = false; + if (!IS_ALIGNED((unsigned long)(src_addr + src_poff), block_size) || + !IS_ALIGNED((unsigned long)(dest_addr + dest_poff), block_size) || + cmp_len < block_size) { + if (memcmp(src_addr + src_poff, dest_addr + dest_poff, + cmp_len)) + same = false; + } else { + int i; + size_t blocks = cmp_len / block_size; + loff_t rem_len = cmp_len - (blocks * block_size); + unsigned long *src = src_addr + src_poff; + unsigned long *dst = dest_addr + src_poff; + + for (i = 0; i < blocks; i++) { + if (src[i] - dst[i]) { + same = false; + goto finished; + } + } + + if (rem_len) { + src_addr += src_poff + (blocks * block_size); + dest_addr += dest_poff + (blocks * block_size); + if (memcmp(src_addr, dest_addr, rem_len)) + same = false; + } + } +finished: kunmap_atomic(dest_addr); kunmap_atomic(src_addr); unlock: -- 2.25.1