Re: Question regarding XFS crisis recovery

Roger Willcocks <roger@xxxxxxxxxxxxxxxx> · Mon, 15 Nov 2021 18:13:59 +0000

In principle that should have worked. And yes, when you’ve got the filesystem back to the point where it mounts, xfs-repair is your only option.

It might have been useful to take an xfs-metadump before the repair, to see what xfs-repair would make if it, and to share it with others for their thoughts.

It does seem like there should be an md resync recovery option which substitutes zeroes for bad blocks instead of giving up immediately. A few blocks of corrupted data in 150 TB is obviously preferable to no data at all.

Or allow it to fall back to reading the ‘dropped out’ drives if there’s a read error elsewhere in the stripe while they’re build rebuilt. 

—
Roger

> On 15 Nov 2021, at 17:14, Sean Caron <scaron@xxxxxxxxx> wrote:
> 
> Hi all,
> 
> I recently had to manage a storage failure on a ~150 TB XFS volume and
> I just wanted to check with the group here to see if anything could
> have been done differently. Here is my story.
> 
> We had a 150 TB RAID 60 volume formatted with XFS. The volume was made
> up of two 21-drive RAID 6 strings (4 TB drives). This was all done
> with Linux MD software RAID.
> 
> The filesystem was filled to 100% capacity when it failed. I'm not
> sure if this contributed to the poor outcome.
> 
> There was no backup available of this filesystem (of course).
> 
> About a week ago, we had two drives become spuriously ejected from one
> of the two RAID 6 strings that composed this volume. This seems to
> happen sometimes as a result of various hardware and software
> glitches. We checked the drives with smartctl, added them back to the
> array and a resync operation started.
> 
> The resync ran for a little while and failed, because a third disk in
> the array (which mdadm had never failed out, and smartctl still
> thought was OK) reported a read error/bad blocks and dropped out of
> the array.
> 
> We decided to clone the failed disk to a brand new replacement drive with:
> 
> dd conv=notrunc,noerror,sync
> 
> Figuring we'd lose a few sectors to get nulled out, but we'd have a
> drive that could run the rebuild without getting kicked due to read
> errors (we've used this technique in the past to recover from this
> kind of situation successfully).
> 
> Clone completed. We swapped the clone drive with the bad blocks drive
> and kicked off another rebuild.
> 
> Rebuild fails again because a fourth drive is throwing bad blocks/read
> errors and gets kicked out of the array.
> 
> We scan all 21 drives in this array with smartctl and there are
> actually three more drives in total where SMART has logged read
> errors.
> 
> This is starting to look pretty bad but what can we do? We just clone
> these three drives to three more fresh drives using dd
> conv=notrunc,noerror,sync.
> 
> Swap them in for the old bad block drives and kick off another
> rebuild. The rebuild actually runs and completes successfully. MD
> thinks the array is fine, running, not degraded at all.
> 
> We mount the array. It mounts, but it is obviously pretty damaged.
> Normally when this happens we try to mount it read only and copy off
> what we can, then write it off. This time, we can't hardly do anything
> but an "ls" in the filesystem without getting "structure needs
> cleaning". Doing any kind of material access to the filesystem gives
> various major errors (i.e. "in-memory corruption of filesystem data
> detected") and the filesystem goes offline. Reads just fail with I/O
> errors.
> 
> What can we do? Seems like at this stage we just run xfs_repair and
> hope for the best, right?
> 
> Ran xfs_repair in dry run mode and it's looking pretty bad, just from
> the sheer amount of output.
> 
> But there's no real way to know exactly how much data xfs_repair will
> wipe out, and what alternatives do we have? The filesystem hardly
> mounts without faulting anyway. Seems like there's little choice going
> forward to run it, and see what shakes out.
> 
> We run xfs_repair overnight. It ran for a while, then eventually hung
> in Phase 4, I think.
> 
> We killed xfs_repair off and re-ran it with the -P flag. It runs for
> maybe two or three hours and eventually completes.
> 
> We mount the filesystem up. Of around 150 TB, we have maybe 10% of
> that in data salad in lost+found, 21 GB of good data and the rest is
> gone.
> 
> Copy off what we can, and call it dead. This is where we're at now.
> 
> It seems like the MD rebuild process really scrambled things somehow.
> I'm not sure if this was due to some kind of kernel bug, or just
> zeroed out bad sectors in wrong places or what. Once the md resync
> ran, we were cooked.
> 
> I guess, after blowing through four or five "Hope you have a backup,
> but if not, you can try this and pray" checkpoints, I just want to
> check with the developers and group here to see if we did the best
> thing possible given the circumstances?
> 
> Xfs_repair is it, right? When things are that scrambled, pretty much
> all you can do is run an xfs_repair and hope for the best? Am I
> correct in thinking that there is no better or alternative tool that
> will give different results?
> 
> Can a commercial data recovery service make any better sense of a
> scrambled XFS than xfs_repair could? When the underlying device is
> presenting OK, just scrambled data on it?
> 
> Thanks,
> 
> Sean
>