On Mon, Feb 12, 2024 at 10:07:33AM -0800, Jorge Garcia wrote: > On Sun, Feb 11, 2024 at 12:39 PM Eric Sandeen <sandeen@xxxxxxxxxxx> wrote: > > > I was going to suggest creating an xfs_metadump image for analysis. > > Was that created with xfsprogs v6.5.0 as well? > > > so the metadump did not complete? > > I actually tried running xfs_metadump with both v5.0 and v6.5.0. They > both gave many error messages, but they created files. Not sure what I > can do with those files Nothing - they are incomplete as metadump aborted at when it got that error. > > Does the filesystem mount? Can you mount it -o ro or -o ro,norecovery > > to see how much you can read off of it? > > The file system doesn't mount. the message when I try to mount it is: > > mount: /data: wrong fs type, bad option, bad superblock on /dev/sda1, > missing codepage or helper program, or other error. > > and > > Feb 12 10:06:02 hgdownload1 kernel: XFS (sda1): Superblock has unknown > incompatible features (0x10) enabled. > Feb 12 10:06:02 hgdownload1 kernel: XFS (sda1): Filesystem cannot be > safely mounted by this kernel. > Feb 12 10:06:02 hgdownload1 kernel: XFS (sda1): SB validate failed > with error -22. That has the XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR bit set... > I wonder if that is because I tried a xfs_repair with a newer version... .... which is a result of xfs_repair 6.5.0 crashing mid way through repair of the filesystem. Your kernel is too old to recognise the NEEDSREPAIR bit. You can clear it with xfs_db like this: Run this to get the current field value: # xfs_db -c "sb 0" -c "p features_incompat" <dev> Then subtract 0x10 from the value returned and run: # xfs_db -c "sb 0" -c "write features_incompat <val>" <dev> But that won't get you too far - the filesystem is still corrupt and inconsistent. By blowing away the log with xfs_repair before actually determining if the problem was caused by a RAID array issue, you've essentially forced yourself into a filesystem recovery situation. > > If mount fails, what is in the kernel log when it fails? > > > Power losses really should not cause corruption, it's a metadata journaling > > filesytem which should maintain consistency even with a power loss. > > > > What kind of storage do you have, though? Corruption after a power loss often > > stems from a filesystem on a RAID with a write cache that does not honor > > data integrity commands and/or does not have its own battery backup. > > We have a RAID 6 card with a BBU: > > Product Name : AVAGO MegaRAID SAS 9361-8i > Serial No : SK00485396 > FW Package Build: 24.21.0-0017 Ok, so they don't actually have a BBU on board - it's an option to add via a module, but the basic RAID controller doesn't have any power failure protection. These cards are also pretty old tech now - how old is this card, and when was the last time the cache protection module was tested? Indeed, how long was the power out for? The BBU on most RAID controllers is only guaranteed to hold the state for 72 hours (when new) and I've personally seen them last for only a few minutes before dying when the RAID controller had been in continuous service for ~5 years. So the duration of the power failure may be important here. Also, how are the back end disks configured? Do they have their volatile write caches turned off? What cache mode was the RAID controller operating in - write-back or write-through? What's the rest of your storage stack? Do you have MD, LVM, etc between the storage hardware and the filesystem? > I agree that power issues should not cause corruption, but here we > are. Yup. Keep in mind that we do occasionally see these old LSI hardware raid cards corrupt storage on power failure, so we're not necessarily even looking for filesystem problems at this point in time. We need to rule that out first before doing any more damage to the filesystem than you've already done trying to recover it so far... > Somewhere on one of the discussion threads I saw somebody mention > ufsexplorer, and when I downloaded the trial version, it seemed to see > most of the files on the device. I guess if I can't find a way to > recover the current filesystem, I will try to use that to recover the > data. Well, that's a last resort. But if your raid controller is unhealthy or the volume has been corrupted by the raid controller the ufsexplorer won't help you get your data back, either.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx