On Wed, Jan 03, 2018 at 08:16:58PM -0800, Jim Meyering wrote: > On Mon, Jan 1, 2018 at 10:22 PM, Theodore Ts'o <tytso@xxxxxxx> wrote: > > versus the patience needed to recover from > > accidentally dumping 16 million files into a directory --- I prefer > > the latter. I can wait a few minutes.... > > I've just run a test on the spinning-disk file system mentioned above, > and it took 75 minutes to delete 12.8M entries. That's rather nasty. An unrealistic performance expectation is not a code bug. It's a knowledge problem and a learning opportunity. :) So let's put some numbers to this operation. If this is a cache-cold rm, then the number of read IOs required on a perfectly laid out /XFS/ filesystem would be roughly; read IOs = num getdents IO + num inode lookup IO = 12.8M / 32 + 12.8M / 500 = 400,000 + 25,600 = 425,600. Now, lets assume an low seek time for each IO on a sata drive - about 3ms per IO. That means the read IO delay is: read IO delay = 425,600 * 0.003s = 1277s = 21m17s Hence with a perfect layout we've got at least 21m of blocking IO time alone in this workload. If the directory and inodes are scattered, and we get an average seek time (7ms), then we're looking at 50 minutes of blocking read IO time in the workload. Then we've got to add the time for each operation to be executed on the CPU. If I take the number from a 10M file unlink I did yesterday in a VM with device image files hosted on XFS on NVMe drives: File count: 10000000 Create: 918.46 47.14 866.24 Unlink: 1545.99 12.14 1410.97 There's 1410s of system CPU time to unlink 10 million inodes. Add another 25% for the larger inode count in your test, and we've got almost 30 minutes of CPU time in this workload. If we add that to our 50 minutes of IO blocking time, we've got ~80 minutes total runtime. IOWs, 75 minutes on a spinning sata drive to delete 12.8M directory entries is entirely reasonable for any filesystem because the workload is single threaded, blocks on each read IO, requires hundreds of thousands of read IOs to complete and the IO time is largely dependent on the physical layout of the directory and inode structure. That's the reason I asked for iowatcher to be run to analysis these bug reports - it will tell us /exactly/ how much time is being spent waiting for IO, what the seek patterns are, and that will tell us how badly we all suck at laying out large directories. > On the bright side, Kevin Vigor was kind enough to run tests showing > that on some large, fast NVMe devices, everything looks linear: > https://docs.google.com/spreadsheets/d/1bPi8MTvSP4xzzuARPOd5fxFujhBoU2Dxandr-Vh1T9c/edit#gid=0 Of course. That's because the read Io latency time is deterministic, not dependent on the physical layout (i.e. constant "seek" time) and it's much, much faster than a spinning disk - probably around 100 microseconds. In that case, the IO wait time drops to: read IO delay = 425,600 * 0.0001s = 42.6s That's not unreasonable given the above numbers from my test VM showing roughly 2 minutes of wait time in 25 minutes for 10million inodes being removed. IOWs, the performance on fast SSDs and NVMe drives is determined almost entirely by software algorithms and the per-operation CPU cost rather than the physical layout of the structure on disk and the time needed to retreive it into memory. That's why you see variable and slow performance on slow spinning drives, but it gets more and more linear as the devices get faster and less sensitive to physical layout of the filesystem metadata.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx