Re: XFS corruption after power surge/outage

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 13 Feb 2024 08:06:22 +1100

On Mon, Feb 12, 2024 at 10:07:33AM -0800, Jorge Garcia wrote:
> On Sun, Feb 11, 2024 at 12:39 PM Eric Sandeen <sandeen@xxxxxxxxxxx> wrote:
> 
> > I was going to suggest creating an xfs_metadump image for analysis.
> > Was that created with xfsprogs v6.5.0 as well?
> 
> > so the metadump did not complete?
> 
> I actually tried running xfs_metadump with both v5.0 and v6.5.0. They
> both gave many error messages, but they created files. Not sure what I
> can do with those files

Nothing - they are incomplete as metadump aborted at when it got
that error.

> > Does the filesystem mount? Can you mount it -o ro or -o ro,norecovery
> > to see how much you can read off of it?
> 
> The file system doesn't mount. the message when I try to mount it is:
> 
> mount: /data: wrong fs type, bad option, bad superblock on /dev/sda1,
> missing codepage or helper program, or other error.
> 
> and
> 
> Feb 12 10:06:02 hgdownload1 kernel: XFS (sda1): Superblock has unknown
> incompatible features (0x10) enabled.
> Feb 12 10:06:02 hgdownload1 kernel: XFS (sda1): Filesystem cannot be
> safely mounted by this kernel.
> Feb 12 10:06:02 hgdownload1 kernel: XFS (sda1): SB validate failed
> with error -22.

That has the XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR bit set...

> I wonder if that is because I tried a xfs_repair with a newer version...

.... which is a result of xfs_repair 6.5.0 crashing mid way through
repair of the filesystem. Your kernel is too old to recognise the
NEEDSREPAIR bit. You can clear it with xfs_db like this:

Run this to get the current field value:

# xfs_db -c "sb 0" -c "p features_incompat" <dev>

Then subtract 0x10 from the value returned and run:

# xfs_db -c "sb 0" -c "write features_incompat <val>" <dev>

But that won't get you too far - the filesystem is still corrupt and
inconsistent. By blowing away the log with xfs_repair before
actually determining if the problem was caused by a RAID array
issue, you've essentially forced yourself into a filesystem recovery
situation.

> > If mount fails, what is in the kernel log when it fails?
> 
> > Power losses really should not cause corruption, it's a metadata journaling
> > filesytem which should maintain consistency even with a power loss.
> >
> > What kind of storage do you have, though? Corruption after a power loss often
> > stems from a filesystem on a RAID with a write cache that does not honor
> > data integrity commands and/or does not have its own battery backup.
> 
> We have a RAID 6 card with a BBU:
> 
> Product Name    : AVAGO MegaRAID SAS 9361-8i
> Serial No       : SK00485396
> FW Package Build: 24.21.0-0017

Ok, so they don't actually have a BBU on board - it's an option to
add via a module, but the basic RAID controller doesn't have any
power failure protection. These cards are also pretty old tech now -
how old is this card, and when was the last time the cache
protection module was tested?

Indeed, how long was the power out for?

The BBU on most RAID controllers is only guaranteed to hold the
state for 72 hours (when new) and I've personally seen them last for
only a few minutes before dying when the RAID controller had been in
continuous service for ~5 years. So the duration of the power
failure may be important here.

Also, how are the back end disks configured? Do they have their
volatile write caches turned off? What cache mode was the RAID
controller operating in - write-back or write-through?

What's the rest of your storage stack? Do you have MD, LVM, etc
between the storage hardware and the filesystem?

> I agree that power issues should not cause corruption, but here we
> are.

Yup. Keep in mind that we do occasionally see these old LSI
hardware raid cards corrupt storage on power failure, so we're not
necessarily even looking for filesystem problems at this point in
time. We need to rule that out first before doing any more damage to
the filesystem than you've already done trying to recover it so
far...

> Somewhere on one of the discussion threads I saw somebody mention
> ufsexplorer, and when I downloaded the trial version, it seemed to see
> most of the files on the device. I guess if I can't find a way to
> recover the current filesystem, I will try to use that to recover the
> data.

Well, that's a last resort. But if your raid controller is unhealthy
or the volume has been corrupted by the raid controller the
ufsexplorer won't help you get your data back, either....

Cheers,

Dave.

-- 
Dave Chinner
david@xxxxxxxxxxxxx