Re: md RAID with enterprise-class SATA or SAS drives

Phil Turmel <philip@xxxxxxxxxx> · Thu, 10 May 2012 11:15:10 -0400

On 05/10/2012 10:59 AM, Daniel Pocock wrote:
> 
>> Here is where Marcus and I part ways.  A very common report I see on
>> this mailing list is people who have lost arrays where the drives all
>> appear to be healthy.  Given the large size of today's hard drives,
>> even healthy drives will occasionally have an unrecoverable read error.
>>
>> When this happens in a raid array with a desktop drive without SCTERC,
>> the driver times out and reports an error to MD.  MD proceeds to
>> reconstruct the missing data and tries to write it back to the bad
>> sector.  However, that drive is still trying to read the bad sector and
>> ignores the controller.  The write is immediately rejected.  BOOM!  The
>> *write* error ejects that member from the array.  And you are now
>> degraded.
>>
>> If you don't notice the degraded array right away, you probably won't
>> notice until a URE on another drive pops up.  Once that happens, you
>> can't complete a resync to revive the array.
> 
> What action would you recommend for someone running md on desktop drives
> today?  Can md be configured in some way to avoid such a disaster?

You have to set the controller's link timeout greater than the worst-
case recovery time.  Unfortunately, that's generally not specified, and
therefore only discovered when you have a real URE.  In my experience,
it's on the order of two to three minutes.

One thing to keep in mind:  If you set the controller timeout that high,
you may encounter protocol timeouts in your services running on top of
those filesystems.  So it isn't a general solution.

FWIW:  /sys/block/sdX/device/timeout

>> Running a "check" or "repair" on an array without TLER will have the
>> opposite of the intended effect: any URE will kick a drive out instead
>> of fixing it.
>>
>> In the same scenario with an enterprise drive, or a drive with SCTERC
>> turned on, the drive read times out before the controller driver, the
>> controller never resets the link to the drive, and the followup write
>> succeeds.  (The sector is either successfully corrected in place, or
>> it is relocated by the drive.)  No BOOM.
> 
> I tend to agree with that approach, and I think that is what Adaptec is
> proposing in their FAQ
> 
> Presumably, if you really do need one of those sectors, the SCTERC
> timeout can be extended (e.g. by disk recovery software) to try harder?

Sure.  SCTERC is set by the smartctl command.  If you need to run
dd_rescue or some other recovery tool on a disk, you can simply set
SCTERC back to zero (disabled).  Or cycle power on the drive.  But you
would also have to set the controller's timeout, or it is pointless.

I don't know what you'd do with an enterprise drive that has TLER by
default.

>>>> - if a non-RAID SAS card is used, does it matter which card is chosen?
>>>> Does md work equally well with all of them?
>>>
>>> Yes, I believe md raid would work equally well on all SAS HBAs,
>>> however the cards themselves vary in performance. Some cards that have
>>> simple RAID built-in can be flashed to a dumb card in order to reclaim
>>> more card memory (LSI "IR mode" cards), but the performance gain is
>>> generally minimal
>>
>> Hardware RAID cards usually offer battery-backed write cache, which is
>> very valuable in some applications.  I don't have a need for that kind
>> of performance, so I can't speak to the details.  (Is Stan H.
>> listening?)
> 
> BBWC is not just expensive, it also has an extra management overhead,
> batteries need to have full discharges occasionally (at a time when
> cache is off), routine battery replacement, etc

I haven't had to deal with this :-)

Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html