Question regarding XFS crisis recovery

Sean Caron <scaron@xxxxxxxxx> · Mon, 15 Nov 2021 12:14:34 -0500

Hi all,

I recently had to manage a storage failure on a ~150 TB XFS volume and
I just wanted to check with the group here to see if anything could
have been done differently. Here is my story.

We had a 150 TB RAID 60 volume formatted with XFS. The volume was made
up of two 21-drive RAID 6 strings (4 TB drives). This was all done
with Linux MD software RAID.

The filesystem was filled to 100% capacity when it failed. I'm not
sure if this contributed to the poor outcome.

There was no backup available of this filesystem (of course).

About a week ago, we had two drives become spuriously ejected from one
of the two RAID 6 strings that composed this volume. This seems to
happen sometimes as a result of various hardware and software
glitches. We checked the drives with smartctl, added them back to the
array and a resync operation started.

The resync ran for a little while and failed, because a third disk in
the array (which mdadm had never failed out, and smartctl still
thought was OK) reported a read error/bad blocks and dropped out of
the array.

We decided to clone the failed disk to a brand new replacement drive with:

dd conv=notrunc,noerror,sync

Figuring we'd lose a few sectors to get nulled out, but we'd have a
drive that could run the rebuild without getting kicked due to read
errors (we've used this technique in the past to recover from this
kind of situation successfully).

Clone completed. We swapped the clone drive with the bad blocks drive
and kicked off another rebuild.

Rebuild fails again because a fourth drive is throwing bad blocks/read
errors and gets kicked out of the array.

We scan all 21 drives in this array with smartctl and there are
actually three more drives in total where SMART has logged read
errors.

This is starting to look pretty bad but what can we do? We just clone
these three drives to three more fresh drives using dd
conv=notrunc,noerror,sync.

Swap them in for the old bad block drives and kick off another
rebuild. The rebuild actually runs and completes successfully. MD
thinks the array is fine, running, not degraded at all.

We mount the array. It mounts, but it is obviously pretty damaged.
Normally when this happens we try to mount it read only and copy off
what we can, then write it off. This time, we can't hardly do anything
but an "ls" in the filesystem without getting "structure needs
cleaning". Doing any kind of material access to the filesystem gives
various major errors (i.e. "in-memory corruption of filesystem data
detected") and the filesystem goes offline. Reads just fail with I/O
errors.

What can we do? Seems like at this stage we just run xfs_repair and
hope for the best, right?

Ran xfs_repair in dry run mode and it's looking pretty bad, just from
the sheer amount of output.

But there's no real way to know exactly how much data xfs_repair will
wipe out, and what alternatives do we have? The filesystem hardly
mounts without faulting anyway. Seems like there's little choice going
forward to run it, and see what shakes out.

We run xfs_repair overnight. It ran for a while, then eventually hung
in Phase 4, I think.

We killed xfs_repair off and re-ran it with the -P flag. It runs for
maybe two or three hours and eventually completes.

We mount the filesystem up. Of around 150 TB, we have maybe 10% of
that in data salad in lost+found, 21 GB of good data and the rest is
gone.

Copy off what we can, and call it dead. This is where we're at now.

It seems like the MD rebuild process really scrambled things somehow.
I'm not sure if this was due to some kind of kernel bug, or just
zeroed out bad sectors in wrong places or what. Once the md resync
ran, we were cooked.

I guess, after blowing through four or five "Hope you have a backup,
but if not, you can try this and pray" checkpoints, I just want to
check with the developers and group here to see if we did the best
thing possible given the circumstances?

Xfs_repair is it, right? When things are that scrambled, pretty much
all you can do is run an xfs_repair and hope for the best? Am I
correct in thinking that there is no better or alternative tool that
will give different results?

Can a commercial data recovery service make any better sense of a
scrambled XFS than xfs_repair could? When the underlying device is
presenting OK, just scrambled data on it?

Thanks,

Sean