Re: md RAID with enterprise-class SATA or SAS drives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



You quoted me so I'll reply to this. Consider that most people use
drives with NO limit, and that the 7 second limit is standard only
because most RAID cards will freak and start sending resets at around
8 seconds. The danger you discuss is just as prevalent with a 7 second
limit; if a drive is repeatedly having to do *any* read correction
then it should be replaced, but that's a separate discussion on
monitoring. However the notion that a drive routinely doing error
correction within 5 seconds keeps you safer if called upon to do a
rebuild than one that routinely takes 11 seconds is spurious.

I agree that your assertion of "enterprise users don't use md RAID" is
false. Then again, perhaps we should only define enterprises as those
who don't use software RAID.

Regarding something someone else mentioned, as far as I'm aware md
raid kicks drives out based on a read error rate, not only on writes.
This since 2.6.33, and in the patched RHEL/CentOS 6 stuff. see
drivers/md/md.c "#define MD_DEFAULT_MAX_CORRECTED_READ_ERRORS 20"

On Thu, May 10, 2012 at 3:43 PM, Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> wrote:
> On 5/10/2012 10:26 AM, Marcus Sorensen wrote:
>
>> * Using smartctl to increase the ERC timeout on enterprise SATA
>> drives, say to 25 seconds, for use with md. I have no idea if this
>> will cause the drive to actually try different methods of recovery,
>> but it could be a good middle ground.
>
> If a drive needs 25 seconds to recover from a read error it should have
> been replaced long ago.
>
> The only thing that increasing these timeouts to silly high numbers does
> is, hopefully for those doing it anyway, prolong the replacement
> interval of failing drives.
>
> Can anyone guess what the big bear trap is that this places before you?
>  The rest of the drives in the array have been held over much longer as
> well.  So when you go to finally rebuild the replacement for this 25s
> delay king, you'll be more likely to run into unrecoverable errors on
> other array members.  Then you chance losing your entire array, and, for
> many here, all of your data, as hobbyists don't do backups. ;)
>
> Fist 2 rules of managing RAID systems:
>
> 1.  Monitor drives and preemptively replace those going down hill BEFORE
> your RAID controllers or md raid kick them
>
> 1a. Don't wait for controllers/md raid to kick bad drives
>
> 2.  Data is always worth more than disks drives
>
> 2a. If drives cost more than your lost data, you're doing it wrong
>
> --
> Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux