On Tue, Jul 11, 2023 at 05:05:30PM +1000, Chris Dunlop wrote: > On Tue, Jul 11, 2023 at 01:10:11PM +1000, Dave Chinner wrote: > > FYI, Chris: google is classifying your email as spam because it > > doesn't have any dkim/dmarc authentication on it. You're lucky I > > even got this, because I know that gmail mostly just drops such > > email in a black hole now.... > > Dkim hopefully fixed now. Looks good. > More on the problem topic below, and apologies for the aside, but this is my > immediate concern: post the reboot into v5.15.118 I have one of the > filesystems failing to mount with: > > Jul 11 16:13:10 b2 kernel: XFS (dm-146): Metadata CRC error detected at xfs_allocbt_read_verify+0x12/0xb0 [xfs], xfs_bnobt block 0x10a00f7f8 > Jul 11 16:13:10 b2 kernel: XFS (dm-146): Unmount and run xfs_repair > Jul 11 16:13:10 b2 kernel: XFS (dm-146): First 128 bytes of corrupted metadata buffer: > Jul 11 16:13:10 b2 kernel: 00000000: 41 42 33 42 00 00 01 ab 05 0a 78 fe 02 c0 2b ff AB3B......x...+. > Jul 11 16:13:10 b2 kernel: 00000010: 00 00 00 01 0a 00 f7 f8 00 00 00 26 00 3b 2d 40 ...........&.;-@ > Jul 11 16:13:10 b2 kernel: 00000020: cf 42 ed 07 78 a1 4e ff 9e 20 e3 d9 6f fc 3e 30 .B..x.N.. ..o.>0 > Jul 11 16:13:10 b2 kernel: 00000030: 00 00 00 02 80 b2 25 41 08 c7 6c 00 00 00 01 28 ......%A..l....( > Jul 11 16:13:10 b2 kernel: 00000040: 08 c7 6d 3e 00 00 00 c2 08 c7 6f 65 00 00 00 a7 ..m>......oe.... > Jul 11 16:13:10 b2 kernel: 00000050: 08 c7 70 a5 00 00 00 3a 08 c7 71 00 00 00 00 c4 ..p....:..q..... > Jul 11 16:13:10 b2 kernel: 00000060: 08 c7 71 e8 00 00 00 18 08 c7 72 50 00 00 02 c3 ..q.......rP.... > Jul 11 16:13:10 b2 kernel: 00000070: 08 c7 75 b1 00 00 02 4b 08 c7 78 db 00 00 02 3d ..u....K..x....= > Jul 11 16:13:10 b2 kernel: XFS (dm-146): log mount/recovery failed: error -74 > Jul 11 16:13:10 b2 kernel: XFS (dm-146): log mount failed I would suggest that this indicates a torn write of some kind. Given the state of the system when you rebooted, and all the RAID 5/6 writes in progress, this is entirely possible... > Then xfs_repair comes back with: > > # xfs_repair /dev/vg-name/lv-name > Phase 1 - find and verify superblock... > - reporting progress in intervals of 15 minutes > Phase 2 - using internal log > - zero log... > ERROR: The filesystem has valuable metadata changes in a log which needs to > be replayed. Mount the filesystem to replay the log, and unmount it before > re-running xfs_repair. If you are unable to mount the filesystem, then use > the -L option to destroy the log and attempt a repair. > Note that destroying the log may cause corruption -- please attempt a mount > of the filesystem before doing this. > > At this point is "xfs_repair -L" (and attendant potential data loss) my only > option? One option is to correct the CRC with xfs_db, then try to mount the filesystem again, but that will simply allow recovery to try to modify what is possibly a bad btree block and then have other things go really wrong (e.g. cross-linked data, duplicate freespace) at a later point in time. That's potentially far more damaging, and I wouldn't try it without having a viable rollback strategy (e.g. full device snapshot and/or copy).... It's probably safest to just zero the log and run repair at this point. > > This task holds the wc_lock() while it is doing this writeback > > flush. > > > > All the XFS filesystems are stuck in similar ways, such as trying to > > push the journal and getting stuck in IO submission in > > dm-writecache: > ... > > These are getting into dm-writecache, and the lock they are getting > > stuck on is the wc_lock(). > ... > > > > Directory reads are getting stuck in dm-write-cache: > ... > > Same lock. > > > > File data reads are getting stuck the same way in dm-writecache. > > > > File data writes (i.e. writeback) are getting stuck the same way > > in dm-writecache. > > > > Yup, everything is stuck on dm-writecache flushing to the underlying > > RAID-5 device. > > > > I don't see anything indicating a filesystem problem here. This > > looks like a massively overloaded RAID 5 volume. i.e. the fast > > storage that makes up the write cache has filled, and now everything > > is stuck waiting for the fast cache to drain to the slow backing > > store before new writes can be made to the fast cache. IOWs, > > everything is running as RAID5 write speed and there's a huge > > backlog of data being written to the RAID 5 volume(s) keeping them > > 100% busy. > > This had never been an issue with v5.15. Is it possible the upgrade to v6.1 > had a hand in this or that's probably just coincidence? It could be a dm-writecache or md raid regression, but it could be just "luck". > In particular, could "5e672cd69f0a xfs: non-blocking inodegc pushes" cause a > significantly greater write load on the cache? No. > I note that there's one fs (separate to the corrupt one above) that's still > in mount recovery since the boot some hours ago. On past experience that > indicates the inodegc stuff is holding things up, i.e. it would have been > running prior to the reboot - or at least trying to. I only found one inodegc working running in the trace above - it was blocked on writecache doing a rmapbt block read removing extents. It may have been a long running cleanup, but it may not have been. As it is, mount taking a long time because there's a inode with a huge number of extents that need freeing on an unlinked list is *not* related to background inodegc in any way - this has -always- happened with XFS; it's simply what unlinked inode recovery does. i.e. background inodegc just moved the extent freeing from syscall exit context to an async worker thread; the filesystem still has exactly the same work to do. If the system goes down while that work is in progress, log recovery has always finished off that cleanup work... > > If there is no IO going to the RAID 5 volumes, then you've got a > > RAID 5 or physical IO hang of some kind. But I can't find any > > indication of that - everything looks like it's waiting on RAID 5 > > stripe population to make write progress... > > There's been no change to the cache device over the reboot, and it > currently looks busy, but it doesn't look completely swamped: .... About 150 1MB sized reads, and about 1000-1500 much smaller writes each second, with an average write wait of near 10ms. Certainly not a fast device, and it's running at about 50% utilisation with an average queue depth of 10-15 IOs. > It should be handling about the same load as prior to the reboot. If that's the typical sustained load, I wouldn't expect it to have that much extra overhead given small writes on RAID 6 volumes have the worse performance characteristics possible. If the write cache starts flushing lots of small discontiguous writes, I'd expect to see that device go to 100% utilisation and long write wait times for extended periods... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx