Re: XFS corruption after power surge/outage

Eric Sandeen <sandeen@xxxxxxxxxxx> · Sun, 11 Feb 2024 14:39:04 -0600

On 2/9/24 12:39 PM, Jorge Garcia wrote:
> Hello,
> 
> We have a server with a very large (300+ TB) XFS filesystem that we
> use to provide downloads to the world. Last week's storms in
> California caused damage to our machine room, causing unexpected power
> surges and power outages, even in our UPS and generator backed data
> center. One of the end results was some data corruption on our server
> (running Centos 8). After looking around the internet for solutions to
> our issues, the general consensus seemed to be to run xfs_repair on
> the filesystem to get it to recover. We tried that (xfs_repair V 5.0)
> and it seemed to report lots of issues before eventually failing
> during "Phase 6" with an error like:
> 
>   Metadata corruption detected at 0x46d6c4, inode 0x8700657ff8 dinode
> 
>   fatal error -- couldn't map inode 579827236856, err = 117
> 
> After another set of internet searches, we found some postings that
> suggested this could be a bug that may have been fixed in later
> versions, so we built xfs_repair V 6.5 and tried the repair again. The
> results were the same. We even tried "xfs_repair -L", and no joy. So
> now we're desperate. Is the data all lost? We can't mount the
> filesystem. We tried using xfs_metadump (another suggestion from our
> searches) and it reports lots of metadata corruption ending with:

I was going to suggest creating an xfs_metadump image for analysis.
Was that created with xfsprogs v6.5.0 as well?

> Metadata corruption detected at 0x4382f0, xfs_cntbt block 0x1300023518/0x1000
> Metadata corruption detected at 0x4382f0, xfs_cntbt block 0x1300296bf8/0x1000
> Metadata corruption detected at 0x4382f0, xfs_bnobt block 0x137fffb258/0x1000
> Metadata corruption detected at 0x4382f0, xfs_bnobt block 0x138009ebd8/0x1000
> Metadata corruption detected at 0x467858, xfs_inobt block 0x138067f550/0x1000
> Metadata corruption detected at 0x467858, xfs_inobt block 0x13834b39e0/0x1000
> xfs_metadump: bad starting inode offset 5

so the metadump did not complete?

Does the filesystem mount? Can you mount it -o ro or -o ro,norecovery
to see how much you can read off of it?

If mount fails, what is in the kernel log when it fails?

> Not sure what to try next. Any help would be greatly appreciated. Thanks!

Power losses really should not cause corruption, it's a metadata journaling
filesytem which should maintain consistency even with a power loss.

What kind of storage do you have, though? Corruption after a power loss often
stems from a filesystem on a RAID with a write cache that does not honor
data integrity commands and/or does not have its own battery backup.

-Eric

> Jorge
>