RE: Distributed spares

"David Lethe" <david@xxxxxxxxxxxx> · Tue, 14 Oct 2008 08:20:50 -0500

> -----Original Message-----
> From: linux-raid-owner@xxxxxxxxxxxxxxx [mailto:linux-raid-
> owner@xxxxxxxxxxxxxxx] On Behalf Of Martin K. Petersen
> Sent: Tuesday, October 14, 2008 5:12 AM
> To: Keld Jørn Simonsen
> Cc: Billy Crook; Justin Piszcz; Bill Davidsen; Neil Brown; Linux RAID
> Subject: Re: Distributed spares
> 
> >>>>> "Keld" == Keld Jørn Simonsen <keld@xxxxxxxx> writes:
> 
> Keld> I have also been thinking a little on this. My idea is that if
> Keld> bit errors develop on disks, then there is first maybe one bit
> Keld> error, and the crc check on the disk sectors then finds and
> Keld> corrects these.
> 
> Keld> If you rewrite such bit errors, then that bit error will be
> Keld> corrected, and you prevent the one-bit error from developing to
> Keld> a two-bit error that is not correctable by the CRC.
> 
> I think you are assuming that disks are much simpler than they
> actually are.
> 
> A modern disk drive protects a 512-byte sector with a pretty strong
> ECC that's capable of correcting errors up to ~50 bytes.  Yes, that's
> bytes.
> 
> Also, many drive firmwares will internally keep track of problematic
> media areas and rewrite or reallocate affected blocks.  That includes
> stuff like rewriting sectors that are susceptible to bleed due to
> being adjacent to write hot spots.
> 
> --
> Martin K. Petersen	Oracle Linux Engineering
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"
> in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Martin is absolutely correct.  Enterprise class drives have come a long way. They will scan and
fix blocks (but certainly not 100% of them) in background.  The $99 disk drives you get at
the local computer retailer now even have limited BGMS / repair capability.

If you run the built-in diags on disk drives, you can be presented with a list of known bad blocks,
or when you boot a disk drive, sometimes you can get a bad block display in POST.

How about a baby step?  When you run offline or online tests, or even when you run media scans,
you get a list of known defects.  How about a program that rewrites a RAID1/3/5/6 stripe, and
you just pass it the physical device name and known block number?    

As for checking out a disk ..

The prior poster's idea about putting the RAID in degraded mode for purposes of checking out a disk is, 
Frankly, nuts. NEVER degrade anything.   Just use the hotspare and do a hot clone of the disk in question
to the hotspare, then make that disk the new hot spare and repeat..

Equate this to a "Rotating the Tires" mode.

David @ santools com
http://www.santools.com/smart/unix/manual

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html