Re: rm hanging, v6.1.35

Chris Dunlop <chris@xxxxxxxxxxxx> · Wed, 12 Jul 2023 11:13:56 +1000

On Wed, Jul 12, 2023 at 08:21:11AM +1000, Dave Chinner wrote:
On Tue, Jul 11, 2023 at 05:05:30PM +1000, Chris Dunlop wrote:
On Tue, Jul 11, 2023 at 01:10:11PM +1000, Dave Chinner wrote:

More on the problem topic below, and apologies for the aside, but 
this is my immediate concern: post the reboot into v5.15.118 I have 
one of the filesystems failing to mount with:
...
I would suggest that this indicates a torn write of some kind. Given
the state of the system when you rebooted, and all the RAID 5/6
writes in progress, this is entirely possible...
...
It's probably safest to just zero the log and run repair at this
point.

Thanks. I did that last night (after taking a snapshot) - 'xfs_repair 
-v' is still running but there's a LOT of nasty output so it's not 
looking good.

...oh. It finished whilst I've been writing this. If you're interested 
in the log:

https://file.io/XOGokgxgttEX

The directory structure looks sane, I'll start running checks on the 
data.

I don't see anything indicating a filesystem problem here. This
looks like a massively overloaded RAID 5 volume. i.e. the fast
storage that makes up the write cache has filled, and now everything
is stuck waiting for the fast cache to drain to the slow backing
store before new writes can be made to the fast cache. IOWs,
everything is running as RAID5 write speed and there's a huge
backlog of data being written to the RAID 5 volume(s) keeping them
100% busy.

Oddly, of the 56 similarly configured filesystems (writecache/lvm/rbd) 
on this box, with maybe 10 actively sinking writes at any time, that 
one above is the only one that had any trouble on reboot. If it had 
been a general problem w/ the cache device I would have thought more 
of the active writers would have similar issues. Maybe I just got 
lucky - and/or that demonstrates how hard xfs tries to keep your data 
sane.

This had never been an issue with v5.15. Is it possible the upgrade 
to v6.1 had a hand in this or that's probably just coincidence?

It could be a dm-writecache or md raid regression, but it could be
just "luck".

Ugh. Sigh. I guess at some point I'm going to have to bite the bullet 
again, and next time watch the cache device like a hawk. I'll keep an 
eye out for dm-writecache and md raid problems and patches etc. so see 
what might come up.

Is there anything else that occurs to you that I might monitor prior 
to and during any future recurrance of the problem?

In particular, could "5e672cd69f0a xfs: non-blocking inodegc 
pushes" cause a significantly greater write load on the cache?

No.

I note that there's one fs (separate to the corrupt one above) 
that's still in mount recovery since the boot some hours ago. On 
past experience that indicates the inodegc stuff is holding things 
up, i.e. it would have been running prior to the reboot - or at 
least trying to.

I only found one inodegc working running in the trace above - it was
blocked on writecache doing a rmapbt block read removing extents. It
may have been a long running cleanup, but it may not have been.

Probably was a cleanup: that's the reason for the update to v6.1, 
every now and again the box was running into the problem of getting 
blocked on the cleanup. The extent of the problem is significantly 
reduced by moving from one large fs where any extended cleanup would 
block everything, to multiple small-ish fs'es (500G-15T) where the 
blast radius of an extended cleanup is far more constrained. But the 
problem was still pretty annoying when it hits.

Hmmm, maybe I can just carry a local backport of "non-blocking inodegc 
pushes" in my local v5.15. That would push back my need to move do 
v6.1.

Or could / should it be considered for an official backport?  Looks 
like it applies cleanly to current v5.15.120.

As it is, mount taking a long time because there's a inode with a
huge number of extents that need freeing on an unlinked list is
*not* related to background inodegc in any way - this has -always-
happened with XFS; it's simply what unlinked inode recovery does.

i.e. background inodegc just moved the extent freeing from syscall
exit context to an async worker thread; the filesystem still has
exactly the same work to do. If the system goes down while that work
is in progress, log recovery has always finished off that cleanup
work...

Gotcha. I'd mistakenly thought "non-blocking inodegc pushes" queued up 
the garbage collection for background processing. My further mistaken 
hand-wavy thought was that, if the processing that was previously 
foreground was now punted to background (perhaps with different 
priorities) maybe the background processing was simply way more 
efficient, enough to swamp the cache with metadata updates.

But with your further explanation and actually reading the patch 
(should have done that first) shows the gc was already queued, the 
update was to NOT wait for the queue to be flushed.

Hmmm, then again... might it be possible that, without the patch, at 
some point after a large delete, further work was blocked whilst 
waiting for the queue to be flushed, limiting the total amount of 
work, but with the patch, the further work (e.g. more deletes) is able 
to be queued - possibly to the point of swamping the cache device?

There's been no change to the cache device over the reboot, and it
currently looks busy, but it doesn't look completely swamped:
....

About 150 1MB sized reads, and about 1000-1500 much smaller
writes each second, with an average write wait of near 10ms.
Certainly not a fast device, and it's running at about 50%
utilisation with an average queue depth of 10-15 IOs.

That's hopefully the cache working as intended: sinking small 
continuous writes (network data uploads) and aggregating them into 
larger blocks to flush out to the bulk storage (i.e. reads from the 
cache to write to the bulk).

It should be handling about the same load as prior to the reboot.

If that's the typical sustained load, I wouldn't expect it to have
that much extra overhead given small writes on RAID 6 volumes have
the worse performance characteristics possible. If the write cache
starts flushing lots of small discontiguous writes, I'd expect to
see that device go to 100% utilisation and long write wait times for
extended periods...

There shouldn't be many small discontigous writes in the data: it's 
basically network uploads to sequential files in the 100s MB to multi 
GB range. But there's also a bunch of reflinking, not to mention 
occasionally removing highly reflinked multi-TB files, so these 
metadata updates might count as "lots of small discontiguous writes"?

Thanks for your help,

Chris