Re: fstrim and strace considered harmful?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, May 18, 2022 at 08:59:00AM -0700, Darrick J. Wong wrote:
On Wed, May 18, 2022 at 05:07:13PM +1000, Chris Dunlop wrote:
Oh, sorry... on linux v5.15.34

On Wed, May 18, 2022 at 04:59:49PM +1000, Chris Dunlop wrote:
I have an fstrim that's been running for over 48 hours on a 256T thin
provisioned XFS fs containing around 55T of actual data on a slow
subsystem (ceph 8,3 erasure-encoded rbd). I don't think there would be
an an enourmous amount of data to trim, maybe a few T, but I've no idea
how long how long it might be expected to take. In an attempt to see
what the what the fstrim was doing, I ran an strace on it. The strace
has been sitting there without output and unkillable since then, now 5+
hours ago.  Since the strace, on that same filesystem I now have 123 df
processes and 615 rm processes -- and growing -- that are blocked in
xfs_inodegc_flush, e.g.:
...
It looks like the storage device is stalled on the discard, and most
everything else is stuck waiting for buffer locks?  The statfs threads
are the same symptom as last time.

Note: the box has been rebooted and it's back to normal after an anxious 30 minutes waiting for the mount recovery. (Not an entirely wasted 30 minutes - what a thrilling stage of the Giro d'Italia!)

I'm not sure if the fstrim was stalled, unless the strace had stalled it somehow: it had been running for ~48 hours without apparent issues before the strace was attached, and then it was another hour before the first process stuck on xfs_inodegc_flush appeared.

The open question is what caused the stuck processes? It's possible the strace was involved: the stuck process with the earliest start time, a "df", was started an hour after the strace and it's entirely plausible that was the very first df or rm issued after the strace. However it's also plausible that was a coincidence and the strace had nothing to do with it. Indeed it's even plausible the fstrim had nothing to do with the stuck processes and there's something else entirely going on: I don't know if there's a ticking time bomb somewhere in the system

It's now no mystery to me why the fstrim was taking so long, nor why the strace didn't produce any output: it turns out fstrim, without an explicit --offset --length range, issues a single ioctl() to trim from the start of the device to the end, and without an explicit --minimum, uses /sys/block/xxx/queue/discard_granularity as the minimum block size to discard, in this case 64kB. So it would have been issuing a metric shit-ton of discard requests to the underlying storage, something close to:

  (fs-size - fs-used) / discard-size
  256T - 26T / 64k
  3,858,759,680 requests

It was after figuring out all that that I hit the reset.

Note: it turns out the actual used space per the filesystem is 26T, whilst the underlying storage shows 55T used, i.e. there's 29T of real discards to process. With this ceph rbd storage I don't know if a "real" discard takes any more or less time than a discard to already-unoccupied storage. Next time I'll issue the fstrim in much smaller increments, e.g. starting with perhaps 128G (at least at first), and use a --minimum that matches the underlying object size (4MB). Then play around and monitor it to work out what parameters work best for this system.

Cheers,

Chris - older, wiser, a little more sleep deprived



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux