Re: XFS corruption after power surge/outage

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Feb 12, 2024 at 1:06 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:

> Ok, so they don't actually have a BBU on board - it's an option to
> add via a module, but the basic RAID controller doesn't have any
> power failure protection. These cards are also pretty old tech now -
> how old is this card, and when was the last time the cache
> protection module was tested?

The card claims Mfg. Date or 01/26/20, which is not too old. The last
time the cache protection was tested? No idea.
The BBU status for the card reports battery state optimal.

> Indeed, how long was the power out for?

I'm not exactly sure how long power was out for, but probably less
than an hour. The data center is supposed to have UPS power and
generator power, but a breaker tripped, and we lost power. My guess is
that power was out for less than an hour, and probably more like a few
minutes.

>
> The BBU on most RAID controllers is only guaranteed to hold the
> state for 72 hours (when new) and I've personally seen them last for
> only a few minutes before dying when the RAID controller had been in
> continuous service for ~5 years. So the duration of the power
> failure may be important here.
>
> Also, how are the back end disks configured? Do they have their
> volatile write caches turned off? What cache mode was the RAID
> controller operating in - write-back or write-through?
>
> What's the rest of your storage stack? Do you have MD, LVM, etc
> between the storage hardware and the filesystem?

You may be asking questions I'm not sure how to answer. Most of the
settings are default settings. RAID controller was operating in WB
mode. No MD or LVM, just 24 disks in MegaRAID RAID-6 configuration,
then seen by the OS as one device, which was formatted as XFS.

> > Somewhere on one of the discussion threads I saw somebody mention
> > ufsexplorer, and when I downloaded the trial version, it seemed to see
> > most of the files on the device. I guess if I can't find a way to
> > recover the current filesystem, I will try to use that to recover the
> > data.
>
> Well, that's a last resort. But if your raid controller is unhealthy
> or the volume has been corrupted by the raid controller the
> ufsexplorer won't help you get your data back, either....

The controller is reporting everything as working, all disks are
Online and Spun Up, and no errors reported as far as I can tell. I did
get ufsexplorer, and it seems to be recovering data, but it will take
days or weeks to recover all of the data. Still would like to know
more of what happened and how to prevent it from happening in the
future, and what would have been the correct sequence of steps I
should have done when encountering a problem like this.

Thanks for all your help!





[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux