On Mon, May 09, 2022 at 12:46:59PM +1000, Chris Dunlop wrote: > Hi, > > Is it to be expected that removing 29TB of highly reflinked and fragmented > data could take days, the entire time blocking other tasks like "rm" and > "df" on the same filesystem? > > - is any way to see progress and make an eta estimate? > - would 4700+ processes blocked on the same fs slow things down or they're > not relevant? > - with a reboot/remount, does the log replay continue from where it left > off, or start again? > - is there anything that might improve the situation in newer kernels? > > Some details: > > # uname -r > 5.15.34-otn-00007-g6bff5dd37abb > > # xfs_info /chroot > meta-data=/dev/mapper/vg00-chroot isize=512 agcount=282, agsize=244184192 blks > = sectsz=4096 attr=2, projid32bit=1 > = crc=1 finobt=1, sparse=1, rmapbt=1 > = reflink=1 bigtime=0 inobtcount=0 > data = bsize=4096 blocks=68719475712, imaxpct=1 > = sunit=128 swidth=512 blks > naming =version 2 bsize=4096 ascii-ci=0, ftype=1 > log =internal log bsize=4096 blocks=521728, version=2 > = sectsz=4096 sunit=1 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > > The fs is sitting on lvm and the underlying block device is a ceph 8+3 > erasure-coded rbd. Ok, so -slow- storage. > This fs was originally 30T and has been expanded to 256T: > > Limits to growth > https://www.spinics.net/lists/linux-xfs/msg60451.html > > And it's been the topic of a few other issues: > > Extreme fragmentation ho! > https://www.spinics.net/lists/linux-xfs/msg47707.html > > Mysterious ENOSPC > https://www.spinics.net/lists/linux-xfs/msg55446.html > > The story... > > I did an "rm -rf" of a directory containing a "du"-indicated 29TB spread > over maybe 50 files. The data would have been highly reflinked and > fragmented. A large part of the reflinking would be to files outside the dir > in question, and I imagine maybe only 2-3TB of data would actually be freed > by the "rm". But it's still got to clean up 29TB of shared extent references. Assuming worst case reflink extent fragmentation of 4kB filesystem blocks, 29TB is roughly 7 *billion* references that have to be cleaned up. TANSTAAFL. A single CPU running flat out that isn't limited by log space or metadata writeback can unshare and free about 80-100k extents a second. If you are either RAM or IO bound, then you might only be able to remove a few thousand a second as throughput is limited by the random metadata writeback rate to free journal space for the next transaction. OOOOOOH. > = crc=1 finobt=1, sparse=1, rmapbt=1 You have rmapbt enabled. That significantly increases the amount of metadata every reflink operation has to update. Every shared extent that is removed, the rmap index has to be updated to remove that back-reference from rmapbt. When there is heavy extent sharing occurring, the rmaps get *large*. e.g. share a single block 100 times, there is one refcount record for the 100 references. BUt there are 100 rmap records - one for each of the 100 references to to the block. IOWs, reflink is efficiently recording reference counts, but rmap cannot as every owner/offset tuple is different for each shared extent owner. Likely that means that the maximum sustainable unshare rate is going to be halved - so maybe a max of 40-50k reference removals/s > The "rm" completed in less than a second, but an immediately following "df" > on that fs didn't return, even 60+ hours later(!!). At 10k refs/s removed , that's 3hrs/100M or 30hrs/billion references. At a 1k refs/s removed, that's 3.6 million refs/hr an hour, or ~30hrs/100M refs. At a 100 refs/s removed, that's 360k million refs/hr an hour, or ~300hrs/100M refs. And it's entire possible you asked the filesystem to remove a billion shared references..... > In the meantime, other background processes where also attempting to do > various "rm"s (small amounts of non-reflinked data) and "df"s on the same fs > which were also hanging - building up to an load average of 4700+(!!) over Nothing wrong with a load average of 4700 - it's common to see that sort of thing when there is lots of IO operations in flight. All it is telling you is that the system has a massive IO backlog. > time. A flame graph showed the kernel with 75% idle and 3.75% xfsaild as the > largest users, and the rest a wide variety of other uninteresting stuff. A > "sysrq w" showed the tasks blocking in xfs_inodegc_flush, e.g.: Yeah, I've been considering removing that blocking flush for reasons like this. That's not your problem, though. i.e. If the unlink hadn't been put in the background, you'd be complaining that rm had hung, the system is overloaded, IO bound and you can't kill the rm. ANd then after reboot, log recovery caused the same problem.... So what you see really has nothing to do with background inodegc at all... > Iostat showed consistent high %util and high latencies. E.g. a combined load > average and iostat output at 66 hours after the initial "df" started: > > load %user %nice %sys %iow %stl %idle dev rrqm/s wrqm/s r/s w/s rkB/s wkB/s arq-sz aqu-sz await rwait wwait %util > 07:55:32 4772.3 0.0 0.0 0.2 0.8 0.0 98.8 rbd0 0.0 0.0 0.5 27.9 1.80 530.20 37.46 25.87 911.07 34.67 925.2 97.0 > 07:55:52 4772.1 0.0 0.0 0.2 0.0 0.0 99.7 rbd0 0.0 0.0 0.9 17.2 3.60 548.20 60.97 7.64 422.22 11.83 443.7 99.2 > 07:56:12 4772.1 0.0 0.0 0.3 1.9 0.0 97.6 rbd0 0.0 0.5 0.2 46.9 1.00 513.00 21.80 46.06 976.91 304.40 980.5 96.3 > 07:56:32 4773.1 0.0 0.0 0.3 1.7 0.0 97.8 rbd0 0.0 0.0 0.5 12.9 1.80 306.20 45.97 6.81 508.09 4.67 525.6 57.8 > 07:56:52 4773.8 0.0 0.0 0.3 1.4 0.0 98.0 rbd0 0.0 0.1 0.7 40.4 2.60 591.80 28.96 32.70 796.60 163.23 806.8 92.2 > 07:57:12 4774.4 0.1 0.0 0.3 1.8 0.0 97.7 rbd0 0.0 0.3 0.2 43.1 0.80 541.80 25.06 42.13 973.05 15.25 977.5 84.0 > 07:57:32 4774.8 0.2 0.0 0.7 2.3 0.0 96.5 rbd0 0.0 0.1 0.5 35.2 2.00 496.40 27.92 30.25 847.35 9.00 859.3 85.7 > 07:57:52 4775.4 0.0 0.0 0.4 1.6 0.0 97.7 rbd0 0.0 0.2 0.7 45.1 2.80 510.80 22.43 42.88 936.33 76.21 949.7 73.6 Yup, the system is IO bound on 4kB random write IO, managing a healthy total of about _200 IOPS_ and less than 4MB/s to disk. In short, I don't think you system is even managing 100 reference removals a second - maybe 50 refs/s are being removed because it is completely IO bound on dirty metadata writeback... IOWs, the problem here is that you asked the filesystem to perform *billions* of update operations by running that rm -rf command and your storage simply isn't up to performing such operations. What reflink giveth you, reflink taketh away. > Apart from the iostat there was no other indication any progress was being > made. > > Eventually I thought that perhaps the excessive numbers of blocked processes > might be adding to the woes by causing thundering herd problems and the > like, and decided to reboot the box as the only way to remove those tasks, > with the understanding the journal to be replayed before the fs mount would > complete. Nope, I'd put my house on it not completing journal replay.... > There were few more reboots and mount(-attempts) over the next day as I was > messing around with things - turning off the background processes causing > the ever-increasing loads, taking lvm snapshots, upgrading to the latest > stable version of the kernel etc. > > I'm now wondering if each mount continues from whence the previous > mount(-attempt) left off, or does it start processing the replay log again > from the very beginning? .... because log recovery runs garbage collection of unlinked inodes, which was what was running when you rebooted the system. Only now, you can't access the filesystem at all because it won't complete the mount process.... > It's now been 40 hours since the most recent reboot/mount and the mount > process is still running. The load average is a more reasonable <10, and the > iostat is similar to above. Yup, because there's only hte background inodegc workers operating on the filesystem and you don't have other processes trying to do data IO on it as it hasn't finished mounting yet. The filesystem is doing the same amount of work, making the same progress, yet the load average is only 10. Load average is a very poor measure of the load that is actually on the system...... > Is there any way to see where the log replay is up to, and from there make a > judgement on how much longer it might take, e.g. something in xfs_db, or > some eBPF wizardry? Not really. tracepoints can tell you the operations going on, and if you know enough about the internals of XFS they might tell you how many extents still remain on the inodes to remove.... > I tried using xfs_logprint but I'm not sure what it's telling me, and > multiple versions taken 10 minutes apart don't show any differences that are > obviously "progress" - the only differences are in q:0x[[:xdigit:]]+ in the > TRANS entries, and in a:0x[[:xdigit:]]+ in the INO entries, e.g.: logprint just dumps the active transactions in the log - the ones that are waiting on metadata writeback to move the tail of the log forwards and retire the transaction. What you see here: > t1: > TRANS: tid:0xcf6e0775 #items:50029 trans:0xcf6e0775 q:0x563417886560 > ... > t1+10s: > TRANS: tid:0xcf6e0775 #items:50029 trans:0xcf6e0775 q:0x55b5d8815560 Is the same transaction in the journal. all that changed was the "a:" fields which are the memory address of the region buffer in logprint and have nothing to do with what is in the journal itself. It's largely logprint debug information. IOWs, all this tells us is that the tail of the log is not moving very fast, which tallies with the really slow storage doing random writes and being unable to move the tail of the log forwards very quickly. Indeed, let me put that in context - this transaction has 50,000 items in it. THat's going to take at _least_ 10,000 IOs to write back all that metadata. at maybe 100 IOs/s, just clearing this transaction to free up the log space (maybe 32MB of journal space out of 2GB) will take at least 2 minutes..... And the rest of the journal will have a similar density of items to be flushed, so 2GB/32MB = 128 * 2 minutes = 4 hours. SO even if the unlinks stopped right now, there's a backlog of at least 4 hours worth of dirty metadata to flush to disk. Either unmount of log recovery would take this long to run at that point. Also, that inode had 1.6 million extents and mapped just under 1TB of data. At 100 refs/s being removed, that's 16,000 seconds to clean up that inode. If that is representative of the other files, then 50 x 16,0000 is roughly 10 days.... At some point, you have to pay the price of creating billions of random fine-grained cross references in tens of TBs of data spread across weeks and months of production. You don't notice the scale of the cross-reference because it's taken weeks and months of normal operations to get there. It's only when you finally have to perform an operation that needs to iterate all those references that the scale suddenly becomes apparent. XFS scales to really large numbers without significant degradation, so people don't notice things like object counts or cross references until something like this happens. I don't think there's much we can do at the filesystem level to help you at this point - the inode output in the transaction dump above indicates that you haven't been using extent size hints to limit fragmentation or extent share/COW sizes, so the damage is already present and we can't really do anything to fix that up. I'm also not sure that xfs_repair will help at this point because it has to pull the entire filesystem metadata into memory. I suspect that you don't have a couple of TB of RAM or really fast SSD swap space to throw at that problem (might need both!). I've seen repair take 3 weeks and 250GB of RAM to resolve a corrupted 14TB hardlink farm containing a couple of billion hardlinks on similar slow RAID 6 storage. I don't expect repair will be any faster resolving a couple of billion reflink and rmap references, but I do expect that to require a whole lot more memory than hardlink resolution. repair will garbage collect the unlinked inodes without any additional overhead, but the baseline overhead is likely prohibitive.... Another option might be to get a fast 1-2TB NVMe SSD (or two or 3) and build a cache teir over the top of the ceph rbds using dm-cache. That way the cache teir can suck up all the small random writes and the garbage collection will run much faster. When it is done, run a fstrim on the filesystem to clear all the previously written but now free space from the cache device, then flush the cache device back to storage. That last flush will likely take a long, long time because it will all be small random writes, but should be many less than what the filesystem needed to run to clean up the billions of cross references the reflinks created.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx