bit-rot, crc errors, etc question

Michael Stumpf <mjstumpf@xxxxxxxxx> · Thu, 06 Oct 2005 11:27:16 -0500

Quick question:

Been running a large ext3 filesystem on an LVM set with multiple linux 
/dev/mdX raid5 arrays underneath.  Recently, upon trying to do full 
identical rewrites of every bit (literally) of data, I'm starting to 
find cases where the server locks up/reboots, and the culprit seems to 
be tracked to a first failure of one of the ATA drives having a bad 
CRC.  Replacing the single bad drive fixes the issue.

My best guess is this:  the filesystem is built on the LVM, composed of 
extents.  The extents reside on physical volumes.  The physical volumes 
are developing uncorrectable errors through natural use/time/heat/secret 
alien plot.  These silent failures sit around until I try to access 
those pieces of those drives, at which point big catastrophic failures 
occur, incurring downtime, potential data loss, and expense.

How can I 1) prevent this,  2) detect this,  3) correct this without 
tossing the drive for a single small bad area?

Is the md driver set smart enough to correct around such physical media 
errors?  Are there ways via mdadm/other tools to actively scan for such 
bad areas (obviously in this case filesystem tools to do this are 
useless, right)?  Can I potentially continue using this "bad" drive by 
somehow applying a correction?

Regards-
Michael Stumpf

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html