On Tue, Jan 02, 2018 at 01:21:32AM +0100, Niklas Hambüchen wrote: > Hello filesystem hackers, > > Over in coreutils issue https://bugs.gnu.org/29921 we have found that > unlink()ing `n` many files seems to take in general O(n²) wall time. > > So far me and others (in CC) have benchmarked `rm -r` and other methods > of unlink()ing multiple files and we've found approximately quadratic > performance across: > > * ext4, XFS and zfs-on-linux on spinning disks > * ext4 on SSDs > * the other combinations were not tested yet > > Please find the exact numbers https://bugs.gnu.org/29921. Run iowatcher on the tests and have a look at the IO patterns before and after the modifications to rm. Work out where the seek delays are actually common from and find the common characteristics that all the reports have. > It would be great to hear any kind of insights you may have regarding > this issue, implementation details that might be relevant here, or even > some quick benchmarks on your systems to confirm or refute our findings. So I just ran a quick test here on XFS. Creat files with fsmark, unmount, mount, remove them. compare wall times. Nfiles create time remove time 10k 0m0.518s 0m0.658s 20k 0m1.039s 0m1.342s 100k 0m4.540s 0m4.646s 200k 0m8.567s 0m10.180s 1M 1m23.059s 0m51.582s Looks pretty linear in terms of rm time. This is with rm v8.26, and the rm command is "rm -rf /mnt/scratch/0" which contains a single directory with all the files in it. The typical IO pattern during rm is getdents @ 40calls/s and 20,000 inode lookup calls/s. i.e. inode lookups outnumber directory reads by 500:1. XFS does inode reads in clusters, so that 20,000 inodes/s after a sequential create is being serviced by about 650 read IO/s. not very much, really. This is run on XFS on raid0 of 2 NVMe drives, sparse 500TB file passed to a VM as virtio,cache=none,aio=native drive formatted with XFS. (i.e. XFS on XFS). So, run it on spinning drives. Same VM, just a SATA XFS image file on a sata drive on the host used instead: Nfiles create time remove time 10k 0m0.520s 0m0.643s 20k 0m1.198s 0m1.351s 100k 0m5.306s 0m6.768s 200k 0m11.083s 0m13.763s 1M 1m39.876s 1m14.765s So, pretty much identical, still linear. The difference in perofrmance between the two was the journal size - the smaller filesystem wrote at 50MB/s continually to the journal, the large NVMe filesystem hardly touched it. Remade small filesystem with larger log, IO patterns and performance numbers are identical to the large filesystem. So I locked out 15.5GB of ram, so the test runs short of memory and causes writeback to occur during creations. Barely changed the performance on either spinning or NVMe drives. Just upgraded to coreutils 8.28. performance is basically identical to 8.26. Oh, I just remembered that I've also got some old samsung 840 EVO SSDs attached to that VM! I ran the test again, got the same results. And, while I'm at it, I ran it on the emulated pmem I have configured, and it got the same results. IOWs, I've got XFS on 4.14 producing near identical results for spinning disks, old SATA SSDs, brand new NVMe SSDs and pmem. There simple isn't an O(n^2) problem in the userspace or XFS code before or after the changes pointed out the bug linked above. Keep in mind that running a random script like this on a random filesystem in a random initial state is about the most-sub-optimal environment you can make for a regression test. It won't give reliable results and you can't even compare results from the same machine and filesystem because initial conditions are always different from run to run. > Unless instructed otherwise, I would also go ahead and file bugs for the > respective filesystem types, as quadratic deletion time makes for a bad > user experience and lots of sysadmin tears. I think you're being overly melodramatic. There isn't an obvious problem, just a lot of handwaving. Do some IO analysis that show where the delays are coming from, then we'll have something to work from.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx