On Thu, May 19, 2022 at 08:36:06AM +1000, Chris Dunlop wrote: > On Wed, May 18, 2022 at 08:59:00AM -0700, Darrick J. Wong wrote: > > On Wed, May 18, 2022 at 05:07:13PM +1000, Chris Dunlop wrote: > > > Oh, sorry... on linux v5.15.34 > > > > > > On Wed, May 18, 2022 at 04:59:49PM +1000, Chris Dunlop wrote: > > > > I have an fstrim that's been running for over 48 hours on a 256T thin > > > > provisioned XFS fs containing around 55T of actual data on a slow > > > > subsystem (ceph 8,3 erasure-encoded rbd). I don't think there would be > > > > an an enourmous amount of data to trim, maybe a few T, but I've no idea > > > > how long how long it might be expected to take. In an attempt to see > > > > what the what the fstrim was doing, I ran an strace on it. The strace > > > > has been sitting there without output and unkillable since then, now 5+ > > > > hours ago. Since the strace, on that same filesystem I now have 123 df > > > > processes and 615 rm processes -- and growing -- that are blocked in > > > > xfs_inodegc_flush, e.g.: > ... > > It looks like the storage device is stalled on the discard, and most > > everything else is stuck waiting for buffer locks? The statfs threads > > are the same symptom as last time. > > Note: the box has been rebooted and it's back to normal after an anxious 30 > minutes waiting for the mount recovery. (Not an entirely wasted 30 minutes - > what a thrilling stage of the Giro d'Italia!) > > I'm not sure if the fstrim was stalled, unless the strace had stalled it > somehow: it had been running for ~48 hours without apparent issues before > the strace was attached, and then it was another hour before the first > process stuck on xfs_inodegc_flush appeared. I suspect that it's just that your storage device is really slow at small trims. If you didn't set a minimum trim size, XFS will issue discards on every free space in it's trees. If you have fragmented free space (quite possible if you're using reflink and removing files that have been reflinked and modified) then you could have millions of tiny free spaces that XFS is now asking the storage to free. Dumping the free space histogram of the filesystem will tell us just how much work you asked the storage to do. e.g: # xfs_spaceman -c "freesp" / from to extents blocks pct 1 1 20406 20406 0.03 2 3 14974 35666 0.06 4 7 11773 61576 0.10 8 15 11935 131561 0.22 16 31 15428 359009 0.60 32 63 13594 620194 1.04 64 127 15354 1415541 2.38 128 255 19269 3560215 5.98 256 511 975 355811 0.60 512 1023 831 610381 1.02 1024 2047 398 580983 0.98 2048 4095 275 827636 1.39 4096 8191 156 911802 1.53 8192 16383 90 1051443 1.77 16384 32767 54 1257999 2.11 32768 65535 17 813203 1.37 65536 131071 13 1331349 2.24 131072 262143 18 3501547 5.88 262144 524287 8 2834877 4.76 524288 1048575 8 5722448 9.61 1048576 2097151 6 9189190 15.43 2097152 4194303 4 14026658 23.55 4194304 8388607 2 10348013 17.37 # So on this 1TB filesystem, there's ~125,000 free space extents and the vast majority of them are less than 255 blocks in length (1MB). Hence I run fstrim on this filesystem without a minium size limit, it will issue roughly 125,000 discard requests. If I set a 1MB minimum size, it will issue discards on all free spaces 256 blocks or larger. i.e. it will only issue ~2000 discards and that will cover ~92% of the free space in the filesystem.... > The open question is what caused the stuck processes? Oh, that's easy the easy bit to explain: discard runs with the AGF locked because it is iterating the free space tree directly. Hence operations on that AG are blocked until all the free space in that AG have been discarded. Could be smarter, never needed to be smarter. Now inodegc comes along, and tries to free an inode in that AG, and blocks getting the AGF lock during the inode free operation (likely inode chunk freeing of finobt block allocation). Everythign then backs up on inodegc flushes, which is backed up on discard operations.... > It's now no mystery to me why the fstrim was taking so long, nor why the > strace didn't produce any output: it turns out fstrim, without an explicit > --offset --length range, issues a single ioctl() to trim from the start of > the device to the end, and without an explicit --minimum, uses > /sys/block/xxx/queue/discard_granularity as the minimum block size to > discard, in this case 64kB. So it would have been issuing a metric shit-ton > of discard requests to the underlying storage, something close to: > > (fs-size - fs-used) / discard-size > 256T - 26T / 64k > 3,858,759,680 requests Won't be anywhere near that number - free space in a 256TB filesystem with only 29TB used will have lots of really large contiguous free spaces. Those will get broken down into max discard length chunks, not minimum. Of course, if the bdev is setting a really small max discard size, then that's going to be just a big a problem for you.... > It was after figuring out all that that I hit the reset. Yup, see above for how to actually determine what minimum size to set for a trim.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx