On Wed, Feb 29, 2012 at 11:44:31PM -0500, Theodore Tso wrote: > You might try sorting the entries returned by readdir by inode number before you stat them. This is a long-standing weakness in ext3/ext4, and it has to do with how we added hashed tree indexes to directories in (a) a backwards compatible way, that (b) was POSIX compliant with respect to adding and removing directory entries concurrently with reading all of the directory entries using readdir. > > You might try compiling spd_readdir from the e2fsprogs source tree (in the contrib directory): > > http://git.kernel.org/?p=fs/ext2/e2fsprogs.git;a=blob;f=contrib/spd_readdir.c;h=f89832cd7146a6f5313162255f057c5a754a4b84;hb=d9a5d37535794842358e1cfe4faa4a89804ed209 > > … and then using that as a LD_PRELOAD, and see how that changes things. > > The short version is that we can't easily do this in the kernel since it's a problem that primarily shows up with very big directories, and using non-swappable kernel memory to store all of the directory entries and then sort them so they can be returned in inode number just isn't practical. It is something which can be easily done in userspace, though, and a number of programs (including mutt for its Maildir support) does do, and it helps greatly for workloads where you are calling readdir() followed by something that needs to access the inode (i.e., stat, unlink, etc.) > For reading the files, the acp program I sent him tries to do something similar. I had forgotten about spd_readdir though, we should consider hacking that into cp and tar. One interesting note is the page cache used to help here. Picture two tests: A) time tar cf /dev/zero /home and cp -a /home /new_dir_in_new_fs unmount / flush caches B) time tar cf /dev/zero /new_dir_in_new_fs On ext, The time for B used to be much faster than the time for A because the files would get written back to disk in roughly htree order. Based on Jacek's data, that isn't true anymore. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html