Theodore Tso wrote:
On Thu, Mar 20, 2008 at 06:39:06PM +0100, Andre Noll wrote:
On 12:35, Theodore Tso wrote:
If a mismatch is detected in a RAID-6 configuration, it should be
possible to figure out what should be fixed
It can be figured out under the assumption that exactly one drive has
bad data and all other ones have good data. But that seems to be an
assumption that is hard to verify in reality.
True, but it's what ECC memory does. :-) And most people agree that
it's a useful thing to do with memory.
If you do ECC syndrome checking on every read, and follow that up with
periodic scrubbing so that you catch (and correct) errors quickly, it
is a reasonable assumption to make.
Obviously a warning should be given when you do this kind of ECC
fixups, and if there is an increasing number of ECC fixups that are
being done, that should set off alarms that maybe there is a hardware
problem that needs to be addressed.
Regards,
- Ted
This might have been stated before in the thread, but most of the raid
rebuilds are triggered by easily identified drive failures (i.e., a
completely dead drive or a sequence of bad sectors that generate an IO
error as we read from the platter). Fortunately, these are also the most
common failures in RAID boxes ;-)
The way you deal with class of errors that don't trigger obvious
failures is to do some kind of background scrubbing or add extra
protection data to the disk.
Martin Petersen presented the new "DIF" work at the FS/IO workshop. This
might be an interesting feature to build into MD raid devices:
http://oss.oracle.com/projects/data-integrity/documentation/
You would need to reformat your drives, so this is not a generic
solution for all users, but it really does address the core of the issue.
ric
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html