Re: bit-rot, crc errors, etc question

Mike Hardy <mhardy@xxxxxxx> · Thu, 06 Oct 2005 11:21:14 -0700

Meant to send this to the list as well, just sent it to Michael Stumpf
the first time. Its generally applicable though.

Any other / better thoughts very welcome...

-Mike

-------- Original Message --------
Subject: Re: bit-rot, crc errors, etc question
Date: Thu, 06 Oct 2005 11:19:59 -0700
From: Mike Hardy <mhardy@xxxxxxx>
To: mjstumpf@xxxxxxxxx
References: <43455064.8020102@xxxxxxxxx>

Assuming you're running PATA, use smartd to scheduled staggered daily
short tests, and weekly extended tests of all drives.

If you install smartmontools, you even get a nifty logwatch script that
digests all the disk stats and puts them in the daily maintenance emails.

You'll see a progression of soft read failures, to ECC recovered errors
to unrecoverable block errors generally.

When that happens you just fail the drive, use dd to directly plink the
bad blocks so the drive internals relocate them, use dd again to read
from the blocks to verify they're gone, then re-add the disk. I'd add
that its not a bad idea to put the affected array in read-only mode
while redundancy is lost, unless you're using raid6.

Not much muss, not much fuss.

If you're using SATA, lobby for SMART over SATA to be included in the
mainline kernels, possibly in the form of testing it.

Alternatively, it appears that Neil has just posted a bunch of patches
that enable full raid5 parity scans. That would be nearly as good as
smartd, except it won't tell you drive temparature or alien plot details
the way smartd does :-)

Googling for "BadBlockHowTo" will lead to more info as well. In general
bad blocks are expected, and not hard to recover from. Its all about
proactive detection and quick recovery so redundancy is maintained as
much as possible.

-Mike

Michael Stumpf wrote:
> Quick question:
> 
> Been running a large ext3 filesystem on an LVM set with multiple linux
> /dev/mdX raid5 arrays underneath.  Recently, upon trying to do full
> identical rewrites of every bit (literally) of data, I'm starting to
> find cases where the server locks up/reboots, and the culprit seems to
> be tracked to a first failure of one of the ATA drives having a bad
> CRC.  Replacing the single bad drive fixes the issue.
> 
> My best guess is this:  the filesystem is built on the LVM, composed of
> extents.  The extents reside on physical volumes.  The physical volumes
> are developing uncorrectable errors through natural use/time/heat/secret
> alien plot.  These silent failures sit around until I try to access
> those pieces of those drives, at which point big catastrophic failures
> occur, incurring downtime, potential data loss, and expense.
> 
> How can I 1) prevent this,  2) detect this,  3) correct this without
> tossing the drive for a single small bad area?
> 
> Is the md driver set smart enough to correct around such physical media
> errors?  Are there ways via mdadm/other tools to actively scan for such
> bad areas (obviously in this case filesystem tools to do this are
> useless, right)?  Can I potentially continue using this "bad" drive by
> somehow applying a correction?
> 
> Regards-
> Michael Stumpf
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html