On Wed, Jul 12, 2023 at 11:13:56AM +1000, Chris Dunlop wrote: > On Wed, Jul 12, 2023 at 08:21:11AM +1000, Dave Chinner wrote: > > On Tue, Jul 11, 2023 at 05:05:30PM +1000, Chris Dunlop wrote: > > > On Tue, Jul 11, 2023 at 01:10:11PM +1000, Dave Chinner wrote: > > > > > More on the problem topic below, and apologies for the aside, but > > > this is my immediate concern: post the reboot into v5.15.118 I have > > > one of the filesystems failing to mount with: > ... > > I would suggest that this indicates a torn write of some kind. Given > > the state of the system when you rebooted, and all the RAID 5/6 > > writes in progress, this is entirely possible... > ... > > It's probably safest to just zero the log and run repair at this > > point. > > Thanks. I did that last night (after taking a snapshot) - 'xfs_repair -v' is > still running but there's a LOT of nasty output so it's not looking good. > > ...oh. It finished whilst I've been writing this. If you're interested in > the log: > > https://file.io/XOGokgxgttEX Oh, there's *lots* of CRC errors. All through the free space and rmap btrees, but all in a similar range of LBAs. $ grep CRC ~/Downloads/xfs_repair.log |sort -k 9 Metadata CRC error detected at 0x5557765dfc6d, xfs_rmapbt block 0x10518c600/0x1000 Metadata CRC error detected at 0x5557765dfc6d, xfs_rmapbt block 0x10518c648/0x1000 Metadata CRC error detected at 0x55577660b07d, xfs_bnobt block 0x10518c650/0x1000 Metadata CRC error detected at 0x55577660b07d, xfs_cntbt block 0x10518c668/0x1000 Metadata CRC error detected at 0x55577660b07d, xfs_bnobt block 0x10518ce88/0x1000 Metadata CRC error detected at 0x5557765dfc6d, xfs_rmapbt block 0x10518d6c0/0x1000 Metadata CRC error detected at 0x55577660b07d, xfs_bnobt block 0x10518d6d0/0x1000 Metadata CRC error detected at 0x5557765dfc6d, xfs_rmapbt block 0x10518d6e0/0x1000 Metadata CRC error detected at 0x5557765dfc6d, xfs_rmapbt block 0x10518d828/0x1000 Metadata CRC error detected at 0x55577660b07d, xfs_bnobt block 0x10518d858/0x1000 ..... There's a whole cluster of CRC errors very close together (10 across 0x1200 sectors = 6144 sectors = 3MB of disk space). This pattern of "bad CRC clusters" occurs more often than not as I look through the list of CRC errors. I suspect this indicates something more fundamental wrong with either the RAID volume of the dm-writecache on top of it... > The directory structure looks sane, I'll start running checks on the data. Given the amount of bad metadata, I wouldn't trust anything in that filesystem to be properly intact. > > > > I don't see anything indicating a filesystem problem here. This > > > > looks like a massively overloaded RAID 5 volume. i.e. the fast > > > > storage that makes up the write cache has filled, and now everything > > > > is stuck waiting for the fast cache to drain to the slow backing > > > > store before new writes can be made to the fast cache. IOWs, > > > > everything is running as RAID5 write speed and there's a huge > > > > backlog of data being written to the RAID 5 volume(s) keeping them > > > > 100% busy. > > Oddly, of the 56 similarly configured filesystems (writecache/lvm/rbd) on > this box, with maybe 10 actively sinking writes at any time, that one above > is the only one that had any trouble on reboot. If it had been a general > problem w/ the cache device I would have thought more of the active writers > would have similar issues. Maybe I just got lucky - and/or that demonstrates > how hard xfs tries to keep your data sane. I don't think it's XFS related at all.... > > > This had never been an issue with v5.15. Is it possible the upgrade > > > to v6.1 had a hand in this or that's probably just coincidence? > > > > It could be a dm-writecache or md raid regression, but it could be > > just "luck". > > Ugh. Sigh. I guess at some point I'm going to have to bite the bullet again, > and next time watch the cache device like a hawk. I'll keep an eye out for > dm-writecache and md raid problems and patches etc. so see what might come > up. > > Is there anything else that occurs to you that I might monitor prior to and > during any future recurrance of the problem? Not really. What every problem occurred that started things going bad was not evident from the information that was gathered. > > > In particular, could "5e672cd69f0a xfs: non-blocking inodegc pushes" > > > cause a significantly greater write load on the cache? > > > > No. > > > > > I note that there's one fs (separate to the corrupt one above) > > > that's still in mount recovery since the boot some hours ago. On > > > past experience that indicates the inodegc stuff is holding things > > > up, i.e. it would have been running prior to the reboot - or at > > > least trying to. > > > > I only found one inodegc working running in the trace above - it was > > blocked on writecache doing a rmapbt block read removing extents. It > > may have been a long running cleanup, but it may not have been. > > Probably was a cleanup: that's the reason for the update to v6.1, every now > and again the box was running into the problem of getting blocked on the > cleanup. The extent of the problem is significantly reduced by moving from > one large fs where any extended cleanup would block everything, to multiple > small-ish fs'es (500G-15T) where the blast radius of an extended cleanup is > far more constrained. But the problem was still pretty annoying when it > hits. > > Hmmm, maybe I can just carry a local backport of "non-blocking inodegc > pushes" in my local v5.15. That would push back my need to move do v6.1. > > Or could / should it be considered for an official backport? Looks like it > applies cleanly to current v5.15.120. I thought that had already been done - there's supposed to be someone taking care of 5.15 LTS backports for XFS.... > > As it is, mount taking a long time because there's a inode with a > > huge number of extents that need freeing on an unlinked list is > > *not* related to background inodegc in any way - this has -always- > > happened with XFS; it's simply what unlinked inode recovery does. > > > > i.e. background inodegc just moved the extent freeing from syscall > > exit context to an async worker thread; the filesystem still has > > exactly the same work to do. If the system goes down while that work > > is in progress, log recovery has always finished off that cleanup > > work... > > Gotcha. I'd mistakenly thought "non-blocking inodegc pushes" queued up the > garbage collection for background processing. My further mistaken hand-wavy > thought was that, if the processing that was previously foreground was now > punted to background (perhaps with different priorities) maybe the > background processing was simply way more efficient, enough to swamp the > cache with metadata updates. Doubt it, the processing is exactly the same code, just done from a different task context. > But with your further explanation and actually reading the patch (should > have done that first) shows the gc was already queued, the update was to NOT > wait for the queue to be flushed. > > Hmmm, then again... might it be possible that, without the patch, at some > point after a large delete, further work was blocked whilst waiting for the > queue to be flushed, limiting the total amount of work, but with the patch, > the further work (e.g. more deletes) is able to be queued - possibly to the > point of swamping the cache device? Unlinks don't generate a lot of IO. They do a lot of repeated operations to cached metadata, and the journal relogs all those blocks without triggering IO to the journal, too. > > > There's been no change to the cache device over the reboot, and it > > > currently looks busy, but it doesn't look completely swamped: > > .... > > > > About 150 1MB sized reads, and about 1000-1500 much smaller > > writes each second, with an average write wait of near 10ms. > > Certainly not a fast device, and it's running at about 50% > > utilisation with an average queue depth of 10-15 IOs. > > That's hopefully the cache working as intended: sinking small continuous > writes (network data uploads) and aggregating them into larger blocks to > flush out to the bulk storage (i.e. reads from the cache to write to the > bulk). > > > > It should be handling about the same load as prior to the reboot. > > > > If that's the typical sustained load, I wouldn't expect it to have > > that much extra overhead given small writes on RAID 6 volumes have > > the worse performance characteristics possible. If the write cache > > starts flushing lots of small discontiguous writes, I'd expect to > > see that device go to 100% utilisation and long write wait times for > > extended periods... > > There shouldn't be many small discontigous writes in the data: it's > basically network uploads to sequential files in the 100s MB to multi GB > range. But there's also a bunch of reflinking, not to mention occasionally > removing highly reflinked multi-TB files, so these metadata updates might > count as "lots of small discontiguous writes"? Yeah, the metadata updates end up being small discontiguous writes.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx