Re: XFS disaster recovery

Sean Caron <scaron@xxxxxxxxx> · Tue, 8 Feb 2022 16:24:55 -0500

Thank you so much for your expert consultation on this, Dave. I'm
definitely cognizant of the fact that there may still be inter-file
corruption while metadata is OK as well. It sounds like we've moved
the situation along as much as we can by finding a set of parameters
where xfs_repair will finish without hanging or crashing via
nondestructive testing with the metadata dump and sparse image and we
end up with a product that can be mounted.

We'll move ahead with repairing the filesystem on-disk and copy off
what we can, with the caveat that users will want to go back and check
file integrity once the copies are finished and there may be
additional data loss that isn't captured in what's in lost+found.

Best,

Sean

On Tue, Feb 8, 2022 at 3:56 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>
> On Tue, Feb 08, 2022 at 10:46:45AM -0500, Sean Caron wrote:
> > Hi Dave,
> >
> > I'm sorry for some imprecise language. The array is around 450 TB raw
> > and I will refer to it as roughly half a petabyte but factoring out
> > RAID parity disks and spare disks it should indeed be around 384 TB
> > formatted.
>
> Ah, OK, looks like it was a complete dump, then.
>
> > I found that if I ran the dev tree xfs_repair with the -P option, I
> > could get xfs_repair to complete a run. It exits with return code 130
> > but the resulting loopback image filesystem is mountable and I see
> > around 27 TB in lost+found which would represent around 9% loss in
> > terms of what was actually on the filesystem.
>
> I'm sure that if that much ended up in lost+found, xfs_repair also
> threw away a whole load of metadata which means data will have been
> lost. And with this much metadata corruption occurring, it tends to
> imply that there will be widespread data corruption, too.  Hence I
> think it's worth pointing out (maybe unnecessarily!) that xfs_repair
> doesn't tell you about (or fix) data corruption - it just rebuilds
> the metadata back into a consistent state.
>
> > Given where we started I think this is acceptable (more than
> > acceptable, IMO, I was getting to the point of expecting to have to
> > write off the majority of the filesystem) and it seems like a way
> > forward to get the majority of the data off this old filesystem.
>
> Yes, but you are still going to have to verify the data you can
> still access is not corrupted - random offsets within files could
> now contain garbage regardless of whether the file was moved to
> lost+found or not.
>
> > Is there anything further I should check or any caveats that I should
> > bear in mind applying this xfs_repair to the real filesystem? Or does
> > it seem reasonable to go ahead, repair this and start copying off?
>
> Seems reasonable to repeat the process on the real filesystem, but
> given the caveat about data corruption above, I suspect that the
> entire dataset on the filesystem might still end up being a complete
> write-off.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx