Re: O(n^2) deletion performance

"Theodore Ts'o" <tytso@xxxxxxx> · Mon, 1 Jan 2018 23:33:54 -0500

On Mon, Jan 01, 2018 at 07:49:55PM -0700, Andreas Dilger wrote:
> 
> At one time we discussed to change inode number allocation to be
> piecewise linear for inodes within the same directory.  As a directory
> grows larger, it could grab a chunk of free inodes in the inode table,
> then map the new filename hash to the range of free inodes, and use
> that when selecting the new inode number.  As that chunk is filled up,
> a new, larger chunk of free inodes would be selected and new filenames
> would be mapped into the new chunk.

Well, it's not so simple.  Remember that there are only 16 inodes per
4k inode table block.  And only 32,768 inodes per block group.  In the
workloads discussed in the coreutils bug, there are 1 million to 32
million files being created in a single directory.  At that point,
unless we start doing truly ridiculous readaheads in the inode table,
the disk is going to be randomly seeking no matter what you do.  Also,
if we try to stay strictly within the inodes in the block group,
assuming that the average file size is larger than 4k --- generally a
safe bet --- this will tend to separate the data blocks from the
inodes, which will increase latency when reading files.

And most of the time, optimizing for reading files makes sense (e.g.,
think /usr/include), because that tends to happen more often than rm
-rf workloads.

The bottom line is this gets tricky, especially if the directory is
dynamically growing and shrinking.  Now you might start deleting from
the first chunk, and start allocating from the second chunk, or maybe
the third chunk.  The bottom line is that it's relatively easy to
optimize for specific workloads --- but even if this is a good idea
--- "rm -rf" of zero-length files is not the first workload I would be
hyper-optimizing for.

Cheers,

					- Ted