Re: rm hanging, v6.1.35

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jul 12, 2023 at 08:21:11AM +1000, Dave Chinner wrote:
On Tue, Jul 11, 2023 at 05:05:30PM +1000, Chris Dunlop wrote:
On Tue, Jul 11, 2023 at 01:10:11PM +1000, Dave Chinner wrote:

More on the problem topic below, and apologies for the aside, but this is my immediate concern: post the reboot into v5.15.118 I have one of the filesystems failing to mount with:
...
I would suggest that this indicates a torn write of some kind. Given
the state of the system when you rebooted, and all the RAID 5/6
writes in progress, this is entirely possible...
...
It's probably safest to just zero the log and run repair at this
point.

Thanks. I did that last night (after taking a snapshot) - 'xfs_repair -v' is still running but there's a LOT of nasty output so it's not looking good.

...oh. It finished whilst I've been writing this. If you're interested in the log:

https://file.io/XOGokgxgttEX

The directory structure looks sane, I'll start running checks on the data.

I don't see anything indicating a filesystem problem here. This
looks like a massively overloaded RAID 5 volume. i.e. the fast
storage that makes up the write cache has filled, and now everything
is stuck waiting for the fast cache to drain to the slow backing
store before new writes can be made to the fast cache. IOWs,
everything is running as RAID5 write speed and there's a huge
backlog of data being written to the RAID 5 volume(s) keeping them
100% busy.

Oddly, of the 56 similarly configured filesystems (writecache/lvm/rbd) on this box, with maybe 10 actively sinking writes at any time, that one above is the only one that had any trouble on reboot. If it had been a general problem w/ the cache device I would have thought more of the active writers would have similar issues. Maybe I just got lucky - and/or that demonstrates how hard xfs tries to keep your data sane.

This had never been an issue with v5.15. Is it possible the upgrade to v6.1 had a hand in this or that's probably just coincidence?

It could be a dm-writecache or md raid regression, but it could be
just "luck".

Ugh. Sigh. I guess at some point I'm going to have to bite the bullet again, and next time watch the cache device like a hawk. I'll keep an eye out for dm-writecache and md raid problems and patches etc. so see what might come up.

Is there anything else that occurs to you that I might monitor prior to and during any future recurrance of the problem?

In particular, could "5e672cd69f0a xfs: non-blocking inodegc pushes" cause a significantly greater write load on the cache?

No.

I note that there's one fs (separate to the corrupt one above) that's still in mount recovery since the boot some hours ago. On past experience that indicates the inodegc stuff is holding things up, i.e. it would have been running prior to the reboot - or at least trying to.

I only found one inodegc working running in the trace above - it was
blocked on writecache doing a rmapbt block read removing extents. It
may have been a long running cleanup, but it may not have been.

Probably was a cleanup: that's the reason for the update to v6.1, every now and again the box was running into the problem of getting blocked on the cleanup. The extent of the problem is significantly reduced by moving from one large fs where any extended cleanup would block everything, to multiple small-ish fs'es (500G-15T) where the blast radius of an extended cleanup is far more constrained. But the problem was still pretty annoying when it hits.

Hmmm, maybe I can just carry a local backport of "non-blocking inodegc pushes" in my local v5.15. That would push back my need to move do v6.1.

Or could / should it be considered for an official backport? Looks like it applies cleanly to current v5.15.120.

As it is, mount taking a long time because there's a inode with a
huge number of extents that need freeing on an unlinked list is
*not* related to background inodegc in any way - this has -always-
happened with XFS; it's simply what unlinked inode recovery does.

i.e. background inodegc just moved the extent freeing from syscall
exit context to an async worker thread; the filesystem still has
exactly the same work to do. If the system goes down while that work
is in progress, log recovery has always finished off that cleanup
work...

Gotcha. I'd mistakenly thought "non-blocking inodegc pushes" queued up the garbage collection for background processing. My further mistaken hand-wavy thought was that, if the processing that was previously foreground was now punted to background (perhaps with different priorities) maybe the background processing was simply way more efficient, enough to swamp the cache with metadata updates.

But with your further explanation and actually reading the patch (should have done that first) shows the gc was already queued, the update was to NOT wait for the queue to be flushed.

Hmmm, then again... might it be possible that, without the patch, at some point after a large delete, further work was blocked whilst waiting for the queue to be flushed, limiting the total amount of work, but with the patch, the further work (e.g. more deletes) is able to be queued - possibly to the point of swamping the cache device?

There's been no change to the cache device over the reboot, and it
currently looks busy, but it doesn't look completely swamped:
....

About 150 1MB sized reads, and about 1000-1500 much smaller
writes each second, with an average write wait of near 10ms.
Certainly not a fast device, and it's running at about 50%
utilisation with an average queue depth of 10-15 IOs.

That's hopefully the cache working as intended: sinking small continuous writes (network data uploads) and aggregating them into larger blocks to flush out to the bulk storage (i.e. reads from the cache to write to the bulk).

It should be handling about the same load as prior to the reboot.

If that's the typical sustained load, I wouldn't expect it to have
that much extra overhead given small writes on RAID 6 volumes have
the worse performance characteristics possible. If the write cache
starts flushing lots of small discontiguous writes, I'd expect to
see that device go to 100% utilisation and long write wait times for
extended periods...

There shouldn't be many small discontigous writes in the data: it's basically network uploads to sequential files in the 100s MB to multi GB range. But there's also a bunch of reflinking, not to mention occasionally removing highly reflinked multi-TB files, so these metadata updates might count as "lots of small discontiguous writes"?


Thanks for your help,

Chris



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux