Re: Question regarding XFS crisis recovery

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Mon, 15 Nov 2021 13:35:24 -0500

On Mon, Nov 15, 2021 at 12:22 PM Sean Caron <scaron@xxxxxxxxx> wrote:
>
> Hi all,
>
> I recently had to manage a storage failure on a ~150 TB XFS volume and
> I just wanted to check with the group here to see if anything could
> have been done differently. Here is my story.
>
> We had a 150 TB RAID 60 volume formatted with XFS. The volume was made
> up of two 21-drive RAID 6 strings (4 TB drives). This was all done
> with Linux MD software RAID.
>
> The filesystem was filled to 100% capacity when it failed. I'm not
> sure if this contributed to the poor outcome.
>
> There was no backup available of this filesystem (of course).
>
> About a week ago, we had two drives become spuriously ejected from one
> of the two RAID 6 strings that composed this volume. This seems to
> happen sometimes as a result of various hardware and software
> glitches. We checked the drives with smartctl, added them back to the
> array and a resync operation started.
>
> The resync ran for a little while and failed, because a third disk in
> the array (which mdadm had never failed out, and smartctl still
> thought was OK) reported a read error/bad blocks and dropped out of
> the array.
>
> We decided to clone the failed disk to a brand new replacement drive with:
>
> dd conv=notrunc,noerror,sync
>
> Figuring we'd lose a few sectors to get nulled out, but we'd have a
> drive that could run the rebuild without getting kicked due to read
> errors (we've used this technique in the past to recover from this
> kind of situation successfully).
>
> Clone completed. We swapped the clone drive with the bad blocks drive
> and kicked off another rebuild.
>
> Rebuild fails again because a fourth drive is throwing bad blocks/read
> errors and gets kicked out of the array.
>
> We scan all 21 drives in this array with smartctl and there are
> actually three more drives in total where SMART has logged read
> errors.
>
> This is starting to look pretty bad but what can we do? We just clone
> these three drives to three more fresh drives using dd
> conv=notrunc,noerror,sync.
>
> Swap them in for the old bad block drives and kick off another
> rebuild. The rebuild actually runs and completes successfully. MD
> thinks the array is fine, running, not degraded at all.
>
> We mount the array. It mounts, but it is obviously pretty damaged.
> Normally when this happens we try to mount it read only and copy off
> what we can, then write it off. This time, we can't hardly do anything
> but an "ls" in the filesystem without getting "structure needs
> cleaning". Doing any kind of material access to the filesystem gives
> various major errors (i.e. "in-memory corruption of filesystem data
> detected") and the filesystem goes offline. Reads just fail with I/O
> errors.
>
> What can we do? Seems like at this stage we just run xfs_repair and
> hope for the best, right?
>
> Ran xfs_repair in dry run mode and it's looking pretty bad, just from
> the sheer amount of output.
>
> But there's no real way to know exactly how much data xfs_repair will
> wipe out, and what alternatives do we have? The filesystem hardly
> mounts without faulting anyway. Seems like there's little choice going
> forward to run it, and see what shakes out.
>
> We run xfs_repair overnight. It ran for a while, then eventually hung
> in Phase 4, I think.
>
> We killed xfs_repair off and re-ran it with the -P flag. It runs for
> maybe two or three hours and eventually completes.
>
> We mount the filesystem up. Of around 150 TB, we have maybe 10% of
> that in data salad in lost+found, 21 GB of good data and the rest is
> gone.
>
> Copy off what we can, and call it dead. This is where we're at now.
>
> It seems like the MD rebuild process really scrambled things somehow.
> I'm not sure if this was due to some kind of kernel bug, or just
> zeroed out bad sectors in wrong places or what. Once the md resync
> ran, we were cooked.
>
> I guess, after blowing through four or five "Hope you have a backup,
> but if not, you can try this and pray" checkpoints, I just want to
> check with the developers and group here to see if we did the best
> thing possible given the circumstances?
>
> Xfs_repair is it, right? When things are that scrambled, pretty much
> all you can do is run an xfs_repair and hope for the best? Am I
> correct in thinking that there is no better or alternative tool that
> will give different results?
>
> Can a commercial data recovery service make any better sense of a
> scrambled XFS than xfs_repair could? When the underlying device is
> presenting OK, just scrambled data on it?

I'm going to let others address the XFS issues if any. My take is this
is not at all XFS related, but a problem with lower layers in the
storage stack.

What is the SCT ERC value for each of the drives?  This value must be
less than the kernel's SCSI command timer, which by default is 30
seconds.

It sounds to me like a common misconfiguration where the drive SCT ERC
is not configured, bad sectors accumulate over time because they are
never being fixed up as a result of the miconfiguration. And once a
single stripe is lost, representing a critical amount off file system
metadata, you lose the whole file system. It's a very high penalty for
what is actually an avoidable problem, but relies on esoteric
knowledge and resistence of downstream distros to change kernel
defaults because they don't understand most of the knobs. And upstream
kernel development's reluctance to change defaults because of various
downstream expectations based on them. Those are generally valid
positions, but in the specific case of large software raid arrays,
Linux has a bad reputation strictly because of crap defaults where the
common case is that SCT ERC is a higher value than the SCSI command
timer. And this will *always* lead to data loss, eventually.

Check the *device* timeout with this command
smartctl -l scterc /dev/sdX

Check the *kernel* timeout with this command
cat /sys/block/sdX/device/timeout

If the drive doesn't support configurable SCT ERC, then you must
increase the kernel's command timer to a ridiculous value like 180.
Seriously 180 seconds for a drive to decide whether a sector is
unreadable is ridiculous but the logic of the consumer drive is
there's no redundancy so try as long and hard as possible before
giving up, the exact opposite of what we want in a raid array.

This guide is a bit stale, I prefer to change either SCT ERC or the
command timer with a udev rule. But the result is the same.
https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

-- 
Chris Murphy