Re: rm hanging, v6.1.35

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 12 Jul 2023 08:21:11 +1000

On Tue, Jul 11, 2023 at 05:05:30PM +1000, Chris Dunlop wrote:
> On Tue, Jul 11, 2023 at 01:10:11PM +1000, Dave Chinner wrote:
> > FYI, Chris: google is classifying your email as spam because it
> > doesn't have any dkim/dmarc authentication on it. You're lucky I
> > even got this, because I know that gmail mostly just drops such
> > email in a black hole now....
> 
> Dkim hopefully fixed now.

Looks good.

> More on the problem topic below, and apologies for the aside, but this is my
> immediate concern: post the reboot into v5.15.118 I have one of the
> filesystems failing to mount with:
> 
> Jul 11 16:13:10 b2 kernel: XFS (dm-146): Metadata CRC error detected at xfs_allocbt_read_verify+0x12/0xb0 [xfs], xfs_bnobt block 0x10a00f7f8
> Jul 11 16:13:10 b2 kernel: XFS (dm-146): Unmount and run xfs_repair
> Jul 11 16:13:10 b2 kernel: XFS (dm-146): First 128 bytes of corrupted metadata buffer:
> Jul 11 16:13:10 b2 kernel: 00000000: 41 42 33 42 00 00 01 ab 05 0a 78 fe 02 c0 2b ff  AB3B......x...+.
> Jul 11 16:13:10 b2 kernel: 00000010: 00 00 00 01 0a 00 f7 f8 00 00 00 26 00 3b 2d 40  ...........&.;-@
> Jul 11 16:13:10 b2 kernel: 00000020: cf 42 ed 07 78 a1 4e ff 9e 20 e3 d9 6f fc 3e 30  .B..x.N.. ..o.>0
> Jul 11 16:13:10 b2 kernel: 00000030: 00 00 00 02 80 b2 25 41 08 c7 6c 00 00 00 01 28  ......%A..l....(
> Jul 11 16:13:10 b2 kernel: 00000040: 08 c7 6d 3e 00 00 00 c2 08 c7 6f 65 00 00 00 a7  ..m>......oe....
> Jul 11 16:13:10 b2 kernel: 00000050: 08 c7 70 a5 00 00 00 3a 08 c7 71 00 00 00 00 c4  ..p....:..q.....
> Jul 11 16:13:10 b2 kernel: 00000060: 08 c7 71 e8 00 00 00 18 08 c7 72 50 00 00 02 c3  ..q.......rP....
> Jul 11 16:13:10 b2 kernel: 00000070: 08 c7 75 b1 00 00 02 4b 08 c7 78 db 00 00 02 3d  ..u....K..x....=
> Jul 11 16:13:10 b2 kernel: XFS (dm-146): log mount/recovery failed: error -74
> Jul 11 16:13:10 b2 kernel: XFS (dm-146): log mount failed

I would suggest that this indicates a torn write of some kind. Given
the state of the system when you rebooted, and all the RAID 5/6
writes in progress, this is entirely possible...

> Then xfs_repair comes back with:
> 
> # xfs_repair /dev/vg-name/lv-name
> Phase 1 - find and verify superblock...
>         - reporting progress in intervals of 15 minutes
> Phase 2 - using internal log
>         - zero log...
> ERROR: The filesystem has valuable metadata changes in a log which needs to
> be replayed.  Mount the filesystem to replay the log, and unmount it before
> re-running xfs_repair.  If you are unable to mount the filesystem, then use
> the -L option to destroy the log and attempt a repair.
> Note that destroying the log may cause corruption -- please attempt a mount
> of the filesystem before doing this.
> 
> At this point is "xfs_repair -L" (and attendant potential data loss) my only
> option?

One option is to correct the CRC with xfs_db, then try to mount the
filesystem again, but that will simply allow recovery to try to
modify what is possibly a bad btree block and then have other things
go really wrong (e.g. cross-linked data, duplicate freespace) at a
later point in time. That's potentially far more damaging, and I
wouldn't try it without having a viable rollback strategy (e.g. full
device snapshot and/or copy)....

It's probably safest to just zero the log and run repair at this
point.

> > This task holds the wc_lock() while it is doing this writeback
> > flush.
> > 
> > All the XFS filesystems are stuck in similar ways, such as trying to
> > push the journal and getting stuck in IO submission in
> > dm-writecache:
> ...
> > These are getting into dm-writecache, and the lock they are getting
> > stuck on is the wc_lock().
> ...
> > 
> > Directory reads are getting stuck in dm-write-cache:
> ...
> > Same lock.
> > 
> > File data reads are getting stuck the same way in dm-writecache.
> > 
> > File data writes (i.e. writeback) are getting stuck the same way
> > in dm-writecache.
> > 
> > Yup, everything is stuck on dm-writecache flushing to the underlying
> > RAID-5 device.
> > 
> > I don't see anything indicating a filesystem problem here. This
> > looks like a massively overloaded RAID 5 volume. i.e. the fast
> > storage that makes up the write cache has filled, and now everything
> > is stuck waiting for the fast cache to drain to the slow backing
> > store before new writes can be made to the fast cache. IOWs,
> > everything is running as RAID5 write speed and there's a huge
> > backlog of data being written to the RAID 5 volume(s) keeping them
> > 100% busy.
> 
> This had never been an issue with v5.15. Is it possible the upgrade to v6.1
> had a hand in this or that's probably just coincidence?

It could be a dm-writecache or md raid regression, but it could be
just "luck".

> In particular, could "5e672cd69f0a xfs: non-blocking inodegc pushes" cause a
> significantly greater write load on the cache?

No.

> I note that there's one fs (separate to the corrupt one above) that's still
> in mount recovery since the boot some hours ago. On past experience that
> indicates the inodegc stuff is holding things up, i.e. it would have been
> running prior to the reboot - or at least trying to.

I only found one inodegc working running in the trace above - it was
blocked on writecache doing a rmapbt block read removing extents. It
may have been a long running cleanup, but it may not have been.

As it is, mount taking a long time because there's a inode with a
huge number of extents that need freeing on an unlinked list is
*not* related to background inodegc in any way - this has -always-
happened with XFS; it's simply what unlinked inode recovery does.

i.e. background inodegc just moved the extent freeing from syscall
exit context to an async worker thread; the filesystem still has
exactly the same work to do. If the system goes down while that work
is in progress, log recovery has always finished off that cleanup
work...

> > If there is no IO going to the RAID 5 volumes, then you've got a
> > RAID 5 or physical IO hang of some kind. But I can't find any
> > indication of that - everything looks like it's waiting on RAID 5
> > stripe population to make write progress...
> 
> There's been no change to the cache device over the reboot, and it
> currently looks busy, but it doesn't look completely swamped:
....

About 150 1MB sized reads, and about 1000-1500 much smaller
writes each second, with an average write wait of near 10ms.
Certainly not a fast device, and it's running at about 50%
utilisation with an average queue depth of 10-15 IOs.

> It should be handling about the same load as prior to the reboot.

If that's the typical sustained load, I wouldn't expect it to have
that much extra overhead given small writes on RAID 6 volumes have
the worse performance characteristics possible. If the write cache
starts flushing lots of small discontiguous writes, I'd expect to
see that device go to 100% utilisation and long write wait times for
extended periods...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx