reordering file operations for performance

Phil Karn <karn@xxxxxxxxxxxx> · Sun, 30 Jan 2011 20:47:03 -0800

I have written a file deduplicator, dupmerge, that walks through a file system (or reads a list of files from stdin), sorts them by size, and compares each pair of the same size looking for duplicates. When it finds two distinct files with identical contents on the same file system, it deletes the newer copy and recreates its path name as a hard link to the older version.

For performance it actually compares SHA1 hashes, not the actual file contents. To avoid unnecessary full-file reads, it first compares the hashes of the first pages (4kiB) of each file. Only if they match will I compute and compare the full file hashes. Each file is fully read at most once and sequentially, so if the file occupies a single extent it can be read in a single large contiguous transfer. This is noticeably faster than doing a direct compare, seeking between two files at opposite ends of the disk.

I am looking for additional performance enhancements, and I don't mind using fs-specific features. E.g., I am now stashing the file hashes into xfs extended file attributes.

I regularly run xfs_fsr and have added fallocate() calls to the major file copy utilities, so all of my files are in single extents. Is there an easy way to ask xfs where those extents are located so that I could sort a set of files by location and then access them in a more efficient order?

I know that there's more to reading a file than accessing its data extents. But by the time I'm comparing files I have already lstat()'ed them all so their inodes and directory paths are probably all still in the cache.

Thanks,
Phil

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs