Re: data corruption - the nightmare continues

Jakob Østergaard <jakob@unthought.net> · Wed, 20 Mar 2002 16:31:04 +0100

On Wed, Mar 20, 2002 at 09:38:44AM -0500, Justin wrote:
> As I recall, at least in my U_ situations, when an array
> goes U_, the 'failed' disk is no longer addressable at all,
> until a reboot.. but next time it happens I'll try after
> reboot reading the entire surface before re-writing it
> to see if that picks up any errors.

Ok, cool.

> I could see how a read would fail, until a disk was told
> to write, then the whole surface would work again.. if this
> is common behavior for disks would that perhaps be
> something the raid code could recognize and work around?

To me it has been fairly common.

But what workaround would you put into the MD code ?  Just write a zero block
to the bad sector, and "gracefully" ignore the bad block (leaving the
filesystem with a zeroed out hole) ?   No, the correct action is to kick the
disk (IMO).

I've been thinking about doing things like nightly "scans" of the underlying
disks - but that kind of code is much easier done in userspace (where it
belongs).  Then, you'd have a failed disk in the morning, which is better
than suddenly having a failed disk in a RAID-5 and then losing the entire
array when number two disk fails during the re-sync.

-- 
................................................................
:   jakob@unthought.net   : And I see the elder races,         :
:.........................: putrid forms of man                :
:   Jakob Østergaard      : See him rise and claim the earth,  :
:        OZ9ABN           : his downfall is at hand.           :
:.........................:............{Konkhra}...............:
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html