Re: O(n^2) deletion performance

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 4 Jan 2018 22:42:41 +1100

On Wed, Jan 03, 2018 at 08:16:58PM -0800, Jim Meyering wrote:
> On Mon, Jan 1, 2018 at 10:22 PM, Theodore Ts'o <tytso@xxxxxxx> wrote:
> > versus the patience needed to recover from
> > accidentally dumping 16 million files into a directory --- I prefer
> > the latter.  I can wait a few minutes....
> 
> I've just run a test on the spinning-disk file system mentioned above,
> and it took 75 minutes to delete 12.8M entries. That's rather nasty.

An unrealistic performance expectation is not a code bug. It's a
knowledge problem and a learning opportunity. :)

So let's put some numbers to this operation. If this is a cache-cold rm, then the
number of read IOs required on a perfectly laid out /XFS/ filesystem
would be roughly;
	read IOs = num getdents IO + num inode lookup IO
		 = 12.8M / 32 + 12.8M / 500
		 = 400,000 + 25,600
		 = 425,600.

Now, lets assume an low seek time for each IO on a sata drive - about
3ms per IO. That means the read IO delay is:

	read IO delay = 425,600 * 0.003s
		      = 1277s
		      = 21m17s

Hence with a perfect layout we've got at least 21m of blocking IO
time alone in this workload.  If the directory and inodes are
scattered, and we get an average seek time (7ms), then we're looking
at 50 minutes of blocking read IO time in the workload.

Then we've got to add the time for each operation to be executed on
the CPU. If I take the number from a 10M file unlink I did yesterday
in a VM with device image files hosted on XFS on NVMe drives:

File count: 10000000
Create: 918.46 47.14 866.24
Unlink: 1545.99 12.14 1410.97

There's 1410s of system CPU time to unlink 10 million inodes. Add another
25% for the larger inode count in your test, and we've got almost 30
minutes of CPU time in this workload. If we add that to our 50
minutes of IO blocking time, we've got ~80 minutes total runtime.

IOWs, 75 minutes on a spinning sata drive to delete 12.8M directory
entries is entirely reasonable for any filesystem because the
workload is single threaded, blocks on each read IO, requires
hundreds of thousands of read IOs to complete and the IO time is
largely dependent on the physical layout of the directory and inode
structure.

That's the reason I asked for iowatcher to be run to analysis these
bug reports - it will tell us /exactly/ how much time is being spent
waiting for IO, what the seek patterns are, and that will tell us
how badly we all suck at laying out large directories.

> On the bright side, Kevin Vigor was kind enough to run tests showing
> that on some large, fast NVMe devices, everything looks linear:
> https://docs.google.com/spreadsheets/d/1bPi8MTvSP4xzzuARPOd5fxFujhBoU2Dxandr-Vh1T9c/edit#gid=0

Of course. That's because the read Io latency time is deterministic,
not dependent on the physical layout (i.e. constant "seek" time) and
it's much, much faster than a spinning disk - probably around 100
microseconds. In that case, the IO wait time drops to:

	read IO delay = 425,600 * 0.0001s
		      = 42.6s

That's not unreasonable given the above numbers from my test VM
showing roughly 2 minutes of wait time in 25 minutes for 10million
inodes being removed.

IOWs, the performance on fast SSDs and NVMe drives is determined
almost entirely by software algorithms and the per-operation CPU
cost rather than the physical layout of the structure on disk and
the time needed to retreive it into memory.  That's why you see
variable and slow performance on slow spinning drives, but it gets
more and more linear as the devices get faster and less sensitive to
physical layout of the filesystem metadata....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx