On Tue, Jan 02, 2018 at 01:21:32AM +0100, Niklas Hambüchen wrote: > Hello filesystem hackers, > > Over in coreutils issue https://bugs.gnu.org/29921 we have found that > unlink()ing `n` many files seems to take in general O(n²) wall time. > > So far me and others (in CC) have benchmarked `rm -r` and other methods > of unlink()ing multiple files and we've found approximately quadratic > performance across... So a couple of observations. You're benchmarking the creation and deletion of files that don't have any blocks associated with them, so you are measuring the directory and inode deallocation operations. There are no block (de-)allocation operations going on. Secondly, you aren't dropping the cache between between the creation phase and the timed deletion phase of the test. This means that the results are heavily going to be impacted by the amount of memory available on the system. This doesn't matter all that much since you aren't actually writing any data blocks, and so you're only measuring metadata operations which are going to be journalled. Sorting by inode number helps in the cold cache case, especially if you have data blocks that you need to delete. I also suspect that the original tests of the sort-the-inodes commit were with substantially smaller numbers of files, especially these were non-zero-sized files being deleted. Sorting by inode number helps optimze deleting files in creation order for ext3/4 and that helps optimize updates to the inode table and to the allocation bitmaps. However, the number I/O's and hence the elapsed time is also going to be a function of the directory block updates. In general you're not going to be able delete files in optimal directory tree order *and* simulaneously in optimal inode table order. So fundamentally there's not a whole lot that can be done here as directories get super-sized. For ext4, you can improve things somewhat by using a larger journal size, if the workload is extremely metadata-heavy (which it is in this very artificial case of deleting a directory hierarchy with 16 milion+ zero-length inodes). We now use a large default journal size with newer versions of e2fsprogs, but if your file system was created a while ago, it probably is maxed out with a 128 meg journal. Using a one or two gig journal (e.g., "mke2fs -t ext4 -J size=2048 /dev/sdXX) will help in this case. It's unlikely to help that much for more realistic workloads, though. > Unless instructed otherwise, I would also go ahead and file bugs for the > respective filesystem types, as quadratic deletion time makes for a bad > user experience and lots of sysadmin tears. Unfortunately there's not a whole lot I can do from the ext4 end. And it's not clear to me that optimizing for this extremely artificial workload makes a lot of sense even if we could do something. (I support we could create an rm -rf system call which would work so long as you had sufficient amounts of non-swappable kernel memory available --- but in the end it's going to be impossible to make something which is purely linear for arbitrarily large directories.) - Ted P.S. I suspect that if you had sufficiently large amounts of super-expensive NVMe storage, you'd see super-linear times even for XFS on NVMe.... The fast storage simply puts off the inevitable.