On 9/9/2014 5:24 PM, Eric Sandeen wrote:
On 9/9/14 11:03 AM, Sean Caron wrote:
Barring rare cases, xfs_repair is bad juju.
No, it's not. It is the appropriate tool to use for filesystem repair.
But it is not the appropriate tool for recovery from mangled storage.
It's not all that mangled. Out of over 52,000 files on the backup
server array, only 5758 were missing from the primary array, and most of
those were lost by the corruption of just a couple of directories, where
every file in the directory was lost with the directory itself. Several
directories and a scattering of individual files were deleted with
intent prior to the failure but not yet purged from the backup. Most
were small files - only 29 were larger than 1G. All of those 5758 were
easily recovered. The only ones remaining at issue are 3 files which
cannot be read, written or deleted. The rest have been read and
checksums sucessfully computed and compared. With only 50K files in
question, I am confidant any checksum collisions are of insignificant
probability. Someone is going to have to do a lot of talking to
convince me rsync can read two copies of what should be the same data
and come up with the same checksum value for both, but other
applications would be able to successfully read one of the files and not
the other.
I really don't think Draconian measures are required. Even if it turns
out they are, the existence of the backup allows for a good deal of
fiddling with the main filesystem before one is compelled to give up and
start fresh. This especially since a small amount of the data on the
main array had not yet been backed up to the secondary array. These
e-mails, for example. The rsync job that backs up the main array runs
every morning at 04:00, so files created that day were not backed up,
and for safety I have changed the backup array file system to read-only,
so nothing created since is backed up.
I've actually been running a filesystem fuzzer over xfs images, randomly
corrupting data and testing repair, 1000s of times over. It does
remarkably well.
If you scramble your raid, which means your block device is no longer
an xfs filesystem, but is instead a random tangle of bits and pieces of
other things, of course xfs_repair won't do well, but it's not the right
tool for the job at that stage.
This is nowhere near that stage. A few sectors here and there were
lost because 3 drives were kicked from the array while write operations
were underway. I had to force re-assemble the array, which lost some
data. The vast majority of the data is clearly intact, including most
of the file system structures. Far less than 1% of the data was lost or
corrupted.
_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs