Re: XFS disaster recovery

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 2 Feb 2022 10:33:12 +1100

On Tue, Feb 01, 2022 at 06:07:18PM -0500, Sean Caron wrote:
> Hi all,
> 
> Me again with another not-backed-up XFS filesystem that's in a little
> trouble. Last time I stopped by to discuss my woes, I was told that I
> could check in here and get some help reading the tea leaves before I
> do anything drastic so I'm doing that :)
> 
> Brief backstory: This is a RAID 60 composed of three 18-drive RAID 6
> strings of 8 TB disk drives, around 460 TB total capacity. Last week
> we had a disk fail out of the array. We replaced the disk and the
> recovery hung at around 70%.
> 
> We power cycled the machine and enclosure and got the recovery to run
> to completion. Just as it finished up, the same string dropped another
> drive.
> 
> We replaced that drive and started recovery again. It got a fair bit
> into the recovery, then hung just as did the first drive recovery, at
> around +/- 70%. We power cycled everything again, then started the
> recovery. As the recovery was running again, a third disk started to
> throw read errors.
> 
> At this point, I decided to just stop trying to recover this array so
> it's up with two disks down but otherwise assembled. I figured I would
> just try to mount ro,norecovery and try to salvage as much as possible
> at this point before going any further.
> 
> Trying to mount ro,norecovery, I am getting an error:

Seeing as you've only lost redundancy at this point in time, this
will simply result in trying to mount the filesystem in an
inconsistent state and so you'll see metadata corruptions because
the log has no be replayed.

> metadata I/O error in "xfs_trans_read_buf_map at daddr ... len 8 error 74
> Metadata CRC error detected at xfs_agf_read_verify+0xd0/0xf0 [xfs],
> xfs_agf block ...
> 
> I ran an xfs_repair -L -n just to see what it would spit out. It
> completes within 15-20 minutes (which I feel might be a good sign,
> from my experience, outcomes are inversely proportional to run time),
> but the output is implying that it would unlink over 100,000 files
> (I'm not sure how many total files are on the filesystem, in terms of
> what proportion of loss this would equate to) and it also says:
> 
> "Inode allocation btrees are too corrupted, skipping phases 6 and 7"

This is expected because 'xfs_repair -n' does not recover the log.
Hence you're running checks on an inconsistent fs and repair is
detecting that the inobts are inconsistent so it can't check the
directory structure connectivity and link counts sanely.

What you want to do here is take a metadump of the filesystem (it's
an offline operation) and restore it to a an image file on a
different system (creates a sparse file so just needs to run on a fs
that supports file sizes > 16TB). You can then mount the image file
via "mount -o loop <fs.img> <mntpt>", and it run log recovery on the
image. Then you can unmount it again and see if the resultant
filesystem image contains any corruption via 'xfs_repair -n'.

If there's no problems found, then the original filesysetm is all
good an all you need to do is mount it and everythign should be
there ready for the migration process to non-failing storage.

If there are warnings/repairs needed then you're probably best to
post the output of 'xfs_reapir -n' so we can review it and determine
the best course of action from there.

IOWs, do all the diagnosis/triage of the filesytem state on the
restored metadump images so that we don't risk further damaging the
real storage. If we screw up a restored filesystem image, no big
deal, we can just return it to the original state by restoring it
from the metadump again to try something different.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx