Hi, I'm investigating XFS block-level deduplication via reflink (FIDEDUPERANGE), and I'm trying to figure out some performance problems I've got. I have a fresh filesystem of about 4–8 TB (made with mkfs.xfs 6.1.0) that I copied data into a few days ago, and I'm running 6.13.0-rc4 (since that was the most recent when I last had the change to boot; I believe I've seen this before with older kernels, so I don't think this is a regression). The underlying block device is an LVM volume on top of a RAID-6, and when I read sequentially from large files, it gives me roughly 1.1 GB/sec (although not completely evenly). My deduplication code works in mostly the obvious way, in that it first reads files, hashes blocks from them, then figures out (through some algorithms that are not important here) what file ranges should be deduplicated. And the latter part is slow; almost so slow as to be unusable. For instance, I have 13 files of about 10 GB each that happen to be identical save for the first 20 kB. My program has identified this, and calls ioctl(FIDEDUPERANGE) with one of the files as source and the other 12 as destinations, in consecutive 16 MB chunks (since that's what ioctl_fideduprange(2) recommends; I also tried simply a single 10 GB call earlier, but it was no faster and also stopped after the first gigabyte); strace gives: ioctl(637, BTRFS_IOC_FILE_EXTENT_SAME or FIDEDUPERANGE, {src_offset=4294971392, src_length=16777216, dest_count=12, info=[{dest_fd=638, dest_offset=4294971392}, {dest_fd=639, dest_offset=4294971392}, {dest_fd=640, dest_offset=4294971392}, {dest_fd=641, dest_offset=4294971392}, {dest_fd=642, dest_offset=4294971392}, {dest_fd=643, dest_offset=4294971392}, {dest_fd=644, dest_offset=4294971392}, {dest_fd=645, dest_offset=4294971392}, {dest_fd=646, dest_offset=4294971392}, {dest_fd=647, dest_offset=4294971392}, {dest_fd=648, dest_offset=4294971392}, {dest_fd=649, dest_offset=4294971392}]} This ioctl call successfully deduplicated the data, but it took 71.52 _seconds_. Deduplicating the entire set is on the order of days. I don't understand why this would take so much time; I understand that it needs to make a read to verify that the file ranges are indeed the same (this is the only sane API design!), but it comes out to something like 2800 kB/sec from an array that can deliver almost 400 times that. There is no other activity on the file system in question, so it should not conflict with other activity (locks etc.), and the process does not appear to be taking significant amounts of CPU time. iostat shows read activity varying from maybe 300 kB/sec to 12000 kB/sec or so; /proc/<pid>/stack says: [<0>] folio_wait_bit_common+0x174/0x220 [<0>] filemap_read_folio+0x64/0x8b [<0>] do_read_cache_folio+0x119/0x164 [<0>] __generic_remap_file_range_prep+0x372/0x568 [<0>] generic_remap_file_range_prep+0x7/0xd [<0>] xfs_reflink_remap_prep+0xb7/0x223 [xfs] [<0>] xfs_file_remap_range+0x94/0x248 [xfs] [<0>] vfs_dedupe_file_range_one+0x145/0x181 [<0>] vfs_dedupe_file_range+0x14d/0x1ca [<0>] do_vfs_ioctl+0x483/0x8a4 [<0>] __do_sys_ioctl+0x51/0x83 [<0>] do_syscall_64+0x76/0xd8 [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e Is there anything I can do to speed this up? Is there simply some sort of bug that causes it to be so slow? /* Steinar */ -- Homepage: https://www.sesse.net/