Re: O(n^2) deletion performance

Jim Meyering <jim@xxxxxxxxxxxx> · Mon, 1 Jan 2018 20:27:48 -0800

On Mon, Jan 1, 2018 at 6:49 PM, Andreas Dilger <adilger@xxxxxxxxx> wrote:
> On Jan 1, 2018, at 6:59 PM, Theodore Ts'o <tytso@xxxxxxx> wrote:
>>
>> On Tue, Jan 02, 2018 at 01:21:32AM +0100, Niklas Hambüchen wrote:
>>> Hello filesystem hackers,
>>>
>>> Over in coreutils issue https://bugs.gnu.org/29921 we have found that
>>> unlink()ing `n` many files seems to take in general O(n²) wall time.
>
> Looking at the performance results in these tests (in particular the
> first results that showed system time separate from user time), the
> system time appears to be roughly linear with the number of files that
> are deleted.  It is the user time that is increasing quadratically, so
> that isn't the fault of the filesystem/kernel but rather "rm" itself
> as far as I can see.

Thank you both for the quick replies.

Here are those numbers again:

 nfiles     real   user     sys

 100000    0.51s  0.07s   0.43s
 200000    2.46s  0.15s   0.89s
 400000   10.78s  0.26s   2.21s
 800000   44.72s  0.58s   6.03s
1600000  180.37s  1.06s  10.70s

Since the user and sys durations for the 1.6M case are so low (nowhere
near 2-3x those of the 800K case), I wonder if that test hit the inode
limit for that file system. Niklas, what does df -i report for the
file system on which you ran that test?

Here are some numbers from a system with 32BG of RAM, an ext4 SSD and
starting with 8.8M free inodes. They show user and system time very
close to linear, with a little uptick on the last iteration. "real"
seems to be growing too fast, even here.

$ i=100000; while :; do mkdir x; (cd x && seq $i|xargs touch); env
time -f "%e %U %S $i" env rm -rf x; case $i in 32*) break;; esac;
i=$[i*2]; done
real  user sys   N
0.48 0.05 0.41 100000
1.28 0.10 0.86 200000
3.79 0.21 1.83 400000
10.51 0.43 4.04 800000
28.00 0.88 8.97 1600000
62.54 1.86 19.30 3200000

...
> At one time we discussed to change inode number allocation to be
> piecewise linear for inodes within the same directory.  As a directory
> grows larger, it could grab a chunk of free inodes in the inode table,
> then map the new filename hash to the range of free inodes, and use
> that when selecting the new inode number.  As that chunk is filled up,
> a new, larger chunk of free inodes would be selected and new filenames
> would be mapped into the new chunk.
>
> If this was done, then the hash-order directory traversal for deletion
> or stat would have approximately O(log(n)) inode table blocks to load
> for a given hash range rather than having to get each inode from a
> random inode table block.

Ouch. It sounds like this is a known-O(n) process and the above would
have reduced it to O(log(n)). Can you provide a reference? I am
curious to know how it played out.

Our goal (with fts and coreutils) has been to make it harder for an
accident or maliciousness (with a few million entries in a directory)
to hinder file system traversals. Of course, it's not just rm: any
FS-traversal tool is affected: cp, chmod, chgrp, du, find, tar, etc.
Sure, quotas can help, but even self-inflicted accidents happen on
single-user systems with no quotas.

Idly wondered if the default inode limits could save ext4 users? Perhaps not.
In this 850GB file system, I see it has 48M inodes (caveat, I may have
changed the default when I created it -- don't recall):

  Filesystem     Type Inodes IUsed IFree IUse% Mounted on
  /dev/sda3      ext4    54M  6.0M   48M   12% /x

Putting even half of those all in one directory is quick and easy, yet
would would cause disproportionate pain. And that was a small
partition.