Re: Redundancy check using "echo check > sync_action": error reporting?

Theodore Tso <tytso@xxxxxxx> · Sat, 22 Mar 2008 09:27:33 -0400

On Fri, Mar 21, 2008 at 06:35:43PM +0100, Peter Rabbitson wrote:
>
> Of course it would be possible to instruct md to always read all 
> data+parity chunks and make a comparison on every read. The performance 
> would not be much to write home about though.

Yeah, and that's probably the real problem with this scheme.  You
basically reduce the read bandwidth of your array down to a single
(slowest) disk --- basically the same reason why RAID-2 is a
commercial failure.  

I suspect the best thing we *can* to do is for filesystems that
include checksums in the metadata and/or the data blocks, is if the
CRC doesn't match, to have the filesystem tell the RAID subsystem,
"um, could you send me copies of the data from all of the RAID-1
mirrors, and see if one of the copies from the mirrors causes a valid
checksum".  Something similar could be done with RAID-5/RAID-6 arrays,
if the fs layer could ask the RAID subsystem, "the external checksum
for this block is bad; can you recalculate it from all available
parity stripes assuming the data stripe is invalid".

Ext4 has metadata checksums; U Wisconsin's Iron filesystem (sponsored
with a grant from EMC) did it for both data and metadata, if memory
serves me correctly.  ZFS smashed through the RAID abstraction barrier
and sucked up RAID functionality into the filesystem so they could
this sort of thing; but with the right new set of interfaces, it
should be possible to add this functionality without reimplementing
RAID in each filesystem.

As far as the question of how often this happens, where a disk
silently corrupts a block without returning a media error, it
definitely happens.  Larry McVoy tells a story of periodically running
a per-file CRC across a backup/archival filesystems, and was able to
detect files that had not been modified changing out from under him.
One way this can happen is if the disk accidentally writes some block
to the wrong location on disk; the blockguard extension and various
enterprise databases (since they can control their db-specific on-disk
format) will encode the intended location of a block in their
per-block checksums, to detect this specific type of failure, which
should broad hint that this sort of thing can and does happen.

Does it happen as much as ZFS's marketing literature implies?
Probably not.  But as you start making bigger and bigger filesystems,
the chances that even relatively improbable errors happen start
increasing significantly.  Of course, the flip side of the argument is
that if you are using the huge arrays to store things like music and
video, maybe you don't care about a small amount of data corruption,
since it might not be noticeable to the human eye/ear.  That's a
pretty weak argument though, and it sends shivers up the spins of
people who are storing, for example, medical images of X-ray or CAT
scans.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html