For performance it actually compares SHA1 hashes, not the actual file contents. To avoid unnecessary full-file reads, it first compares the hashes of the first pages (4kiB) of each file. Only if they match will I compute and compare the full file hashes. Each file is fully read at most once and sequentially, so if the file occupies a single extent it can be read in a single large contiguous transfer. This is noticeably faster than doing a direct compare, seeking between two files at opposite ends of the disk.
I am looking for additional performance enhancements, and I don't mind using fs-specific features. E.g., I am now stashing the file hashes into xfs extended file attributes.
I regularly run xfs_fsr and have added fallocate() calls to the major file copy utilities, so all of my files are in single extents. Is there an easy way to ask xfs where those extents are located so that I could sort a set of files by location and then access them in a more efficient order?
I know that there's more to reading a file than accessing its data extents. But by the time I'm comparing files I have already lstat()'ed them all so their inodes and directory paths are probably all still in the cache.
Thanks,
Phil
_______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs