On Tue, May 10, 2022 at 12:55:41PM +1000, Chris Dunlop wrote: > Hi Dave, > > On Tue, May 10, 2022 at 09:09:18AM +1000, Dave Chinner wrote: > > On Mon, May 09, 2022 at 12:46:59PM +1000, Chris Dunlop wrote: > > > Is it to be expected that removing 29TB of highly reflinked and fragmented > > > data could take days, the entire time blocking other tasks like "rm" and > > > "df" on the same filesystem? > ... > > At some point, you have to pay the price of creating billions of > > random fine-grained cross references in tens of TBs of data spread > > across weeks and months of production. You don't notice the scale of > > the cross-reference because it's taken weeks and months of normal > > operations to get there. It's only when you finally have to perform > > an operation that needs to iterate all those references that the > > scale suddenly becomes apparent. XFS scales to really large numbers > > without significant degradation, so people don't notice things like > > object counts or cross references until something like this > > happens. > > > > I don't think there's much we can do at the filesystem level to help > > you at this point - the inode output in the transaction dump above > > indicates that you haven't been using extent size hints to limit > > fragmentation or extent share/COW sizes, so the damage is already > > present and we can't really do anything to fix that up. > > Thanks for taking the time to provide a detailed and informative > exposition, it certainly helps me understand what I'm asking of the fs, the > areas that deserve more attention, and how to approach analyzing the > situation. > > At this point I'm about 3 days from completing copying the data (from a > snapshot of the troubled fs mounted with 'norecovery') over to a brand new > fs. Unfortunately the new fs is also rmapbt=1 so I'll go through all the > copying again (under more controlled circumstances) to get onto a rmapbt=0 > fs (losing the ability to do online repairs whenever that arrives - > hopefully that won't come back to haunt me). Hmm. Were most of the stuck processes running xfs_inodegc_flush? Maybe we should try to switch that to something that will stop waiting after 30s, since most of the (non-fsfreeze) callers don't actually *require* that the work actually finish, they're just trying to return accurate space accounting to userspace. > Out of interest: > > > > - with a reboot/remount, does the log replay continue from where it left > > > off, or start again? > > Sorry, if you provided an answer to this, I didn't understand it. > > Basically the question is, if a recovery on mount were going to take 10 > hours, but the box rebooted and fs mounted again at 8 hours, would the > recovery this time take 2 hours or once again 10 hours? In theory yes, it'll restart where it left off, but if 10 seconds go by and the extent count *hasn't changed* then yikes did we spend that entire time doing refcount btree updates?? --D > Cheers, > > Chris