On Mon, Jan 1, 2018 at 6:49 PM, Andreas Dilger <adilger@xxxxxxxxx> wrote: > On Jan 1, 2018, at 6:59 PM, Theodore Ts'o <tytso@xxxxxxx> wrote: >> >> On Tue, Jan 02, 2018 at 01:21:32AM +0100, Niklas Hambüchen wrote: >>> Hello filesystem hackers, >>> >>> Over in coreutils issue https://bugs.gnu.org/29921 we have found that >>> unlink()ing `n` many files seems to take in general O(n²) wall time. > > Looking at the performance results in these tests (in particular the > first results that showed system time separate from user time), the > system time appears to be roughly linear with the number of files that > are deleted. It is the user time that is increasing quadratically, so > that isn't the fault of the filesystem/kernel but rather "rm" itself > as far as I can see. Thank you both for the quick replies. Here are those numbers again: nfiles real user sys 100000 0.51s 0.07s 0.43s 200000 2.46s 0.15s 0.89s 400000 10.78s 0.26s 2.21s 800000 44.72s 0.58s 6.03s 1600000 180.37s 1.06s 10.70s Since the user and sys durations for the 1.6M case are so low (nowhere near 2-3x those of the 800K case), I wonder if that test hit the inode limit for that file system. Niklas, what does df -i report for the file system on which you ran that test? Here are some numbers from a system with 32BG of RAM, an ext4 SSD and starting with 8.8M free inodes. They show user and system time very close to linear, with a little uptick on the last iteration. "real" seems to be growing too fast, even here. $ i=100000; while :; do mkdir x; (cd x && seq $i|xargs touch); env time -f "%e %U %S $i" env rm -rf x; case $i in 32*) break;; esac; i=$[i*2]; done real user sys N 0.48 0.05 0.41 100000 1.28 0.10 0.86 200000 3.79 0.21 1.83 400000 10.51 0.43 4.04 800000 28.00 0.88 8.97 1600000 62.54 1.86 19.30 3200000 ... > At one time we discussed to change inode number allocation to be > piecewise linear for inodes within the same directory. As a directory > grows larger, it could grab a chunk of free inodes in the inode table, > then map the new filename hash to the range of free inodes, and use > that when selecting the new inode number. As that chunk is filled up, > a new, larger chunk of free inodes would be selected and new filenames > would be mapped into the new chunk. > > If this was done, then the hash-order directory traversal for deletion > or stat would have approximately O(log(n)) inode table blocks to load > for a given hash range rather than having to get each inode from a > random inode table block. Ouch. It sounds like this is a known-O(n) process and the above would have reduced it to O(log(n)). Can you provide a reference? I am curious to know how it played out. Our goal (with fts and coreutils) has been to make it harder for an accident or maliciousness (with a few million entries in a directory) to hinder file system traversals. Of course, it's not just rm: any FS-traversal tool is affected: cp, chmod, chgrp, du, find, tar, etc. Sure, quotas can help, but even self-inflicted accidents happen on single-user systems with no quotas. Idly wondered if the default inode limits could save ext4 users? Perhaps not. In this 850GB file system, I see it has 48M inodes (caveat, I may have changed the default when I created it -- don't recall): Filesystem Type Inodes IUsed IFree IUse% Mounted on /dev/sda3 ext4 54M 6.0M 48M 12% /x Putting even half of those all in one directory is quick and easy, yet would would cause disproportionate pain. And that was a small partition.