On Wed, May 18, 2022 at 08:59:00AM -0700, Darrick J. Wong wrote:
On Wed, May 18, 2022 at 05:07:13PM +1000, Chris Dunlop wrote:
Oh, sorry... on linux v5.15.34
On Wed, May 18, 2022 at 04:59:49PM +1000, Chris Dunlop wrote:
I have an fstrim that's been running for over 48 hours on a 256T thin
provisioned XFS fs containing around 55T of actual data on a slow
subsystem (ceph 8,3 erasure-encoded rbd). I don't think there would be
an an enourmous amount of data to trim, maybe a few T, but I've no idea
how long how long it might be expected to take. In an attempt to see
what the what the fstrim was doing, I ran an strace on it. The strace
has been sitting there without output and unkillable since then, now 5+
hours ago. Since the strace, on that same filesystem I now have 123 df
processes and 615 rm processes -- and growing -- that are blocked in
xfs_inodegc_flush, e.g.:
...
It looks like the storage device is stalled on the discard, and most
everything else is stuck waiting for buffer locks? The statfs threads
are the same symptom as last time.
Note: the box has been rebooted and it's back to normal after an anxious
30 minutes waiting for the mount recovery. (Not an entirely wasted 30
minutes - what a thrilling stage of the Giro d'Italia!)
I'm not sure if the fstrim was stalled, unless the strace had stalled it
somehow: it had been running for ~48 hours without apparent issues before
the strace was attached, and then it was another hour before the first
process stuck on xfs_inodegc_flush appeared.
The open question is what caused the stuck processes? It's possible the
strace was involved: the stuck process with the earliest start time, a
"df", was started an hour after the strace and it's entirely plausible
that was the very first df or rm issued after the strace. However it's
also plausible that was a coincidence and the strace had nothing to do
with it. Indeed it's even plausible the fstrim had nothing to do with the
stuck processes and there's something else entirely going on: I don't know
if there's a ticking time bomb somewhere in the system
It's now no mystery to me why the fstrim was taking so long, nor why the
strace didn't produce any output: it turns out fstrim, without an explicit
--offset --length range, issues a single ioctl() to trim from the start of
the device to the end, and without an explicit --minimum, uses
/sys/block/xxx/queue/discard_granularity as the minimum block size to
discard, in this case 64kB. So it would have been issuing a metric
shit-ton of discard requests to the underlying storage, something close
to:
(fs-size - fs-used) / discard-size
256T - 26T / 64k
3,858,759,680 requests
It was after figuring out all that that I hit the reset.
Note: it turns out the actual used space per the filesystem is 26T, whilst
the underlying storage shows 55T used, i.e. there's 29T of real discards
to process. With this ceph rbd storage I don't know if a "real" discard
takes any more or less time than a discard to already-unoccupied storage.
Next time I'll issue the fstrim in much smaller increments, e.g. starting
with perhaps 128G (at least at first), and use a --minimum that matches
the underlying object size (4MB). Then play around and monitor it to work
out what parameters work best for this system.
Cheers,
Chris - older, wiser, a little more sleep deprived