Re: O(n^2) deletion performance

Andreas Dilger <adilger@xxxxxxxxx> · Mon, 1 Jan 2018 19:49:55 -0700

On Jan 1, 2018, at 6:59 PM, Theodore Ts'o <tytso@xxxxxxx> wrote:
> 
> On Tue, Jan 02, 2018 at 01:21:32AM +0100, Niklas Hambüchen wrote:
>> Hello filesystem hackers,
>> 
>> Over in coreutils issue https://bugs.gnu.org/29921 we have found that
>> unlink()ing `n` many files seems to take in general O(n²) wall time.

Looking at the performance results in these tests (in particular the
first results that showed system time separate from user time), the
system time appears to be roughly linear with the number of files that
are deleted.  It is the user time that is increasing quadratically, so
that isn't the fault of the filesystem/kernel but rather "rm" itself
as far as I can see.

>> So far me and others (in CC) have benchmarked `rm -r` and other methods
>> of unlink()ing multiple files and we've found approximately quadratic
>> performance across...
> 
> So a couple of observations.  You're benchmarking the creation and
> deletion of files that don't have any blocks associated with them, so
> you are measuring the directory and inode deallocation operations.
> There are no block (de-)allocation operations going on.
> 
> Secondly, you aren't dropping the cache between between the creation
> phase and the timed deletion phase of the test.  This means that the
> results are heavily going to be impacted by the amount of memory
> available on the system.  This doesn't matter all that much since you
> aren't actually writing any data blocks, and so you're only measuring
> metadata operations which are going to be journalled.
> 
> Sorting by inode number helps in the cold cache case, especially if
> you have data blocks that you need to delete.  I also suspect that the
> original tests of the sort-the-inodes commit were with substantially
> smaller numbers of files, especially these were non-zero-sized files
> being deleted.
> 
> Sorting by inode number helps optimze deleting files in creation order
> for ext3/4 and that helps optimize updates to the inode table and to
> the allocation bitmaps.  However, the number I/O's and hence the
> elapsed time is also going to be a function of the directory block
> updates.
> 

> In general you're not going to be able delete files in optimal
> directory tree order *and* simulaneously in optimal inode table order.
> So fundamentally there's not a whole lot that can be done here as
> directories get super-sized.

At one time we discussed to change inode number allocation to be
piecewise linear for inodes within the same directory.  As a directory
grows larger, it could grab a chunk of free inodes in the inode table,
then map the new filename hash to the range of free inodes, and use
that when selecting the new inode number.  As that chunk is filled up,
a new, larger chunk of free inodes would be selected and new filenames
would be mapped into the new chunk.

If this was done, then the hash-order directory traversal for deletion
or stat would have approximately O(log(n)) inode table blocks to load
for a given hash range rather than having to get each inode from a
random inode table block.

> For ext4, you can improve things somewhat by using a larger journal
> size, if the workload is extremely metadata-heavy (which it is in this
> very artificial case of deleting a directory hierarchy with 16 milion+
> zero-length inodes).  We now use a large default journal size with
> newer versions of e2fsprogs, but if your file system was created a
> while ago, it probably is maxed out with a 128 meg journal.  Using a
> one or two gig journal (e.g., "mke2fs -t ext4 -J size=2048 /dev/sdXX)
> will help in this case.  It's unlikely to help that much for more
> realistic workloads, though.
> 
>> Unless instructed otherwise, I would also go ahead and file bugs for the
>> respective filesystem types, as quadratic deletion time makes for a bad
>> user experience and lots of sysadmin tears.
> 
> Unfortunately there's not a whole lot I can do from the ext4 end.  And
> it's not clear to me that optimizing for this extremely artificial
> workload makes a lot of sense even if we could do something.  (I
> support we could create an rm -rf system call which would work so long
> as you had sufficient amounts of non-swappable kernel memory available
> --- but in the end it's going to be impossible to make something which
> is purely linear for arbitrarily large directories.)

Cheers, Andreas