Re: recommended way to add ssd cache to mdraid array

Phil Turmel <philip@xxxxxxxxxx> · Sun, 13 Jan 2013 17:13:44 -0500

On 01/11/2013 10:56 PM, Chris Murphy wrote:
> 
> On Jan 11, 2013, at 5:47 PM, Phil Turmel <philip@xxxxxxxxxx> wrote:

[trim /]

>> This isn't quite right.  When the linux driver stack times out, it
>>  passes the error to MD.  MD doesn't care if the drive reported the
>>  error, or if the controller reported the error, it just knows
>> that it couldn't read that block.  It goes to recovery, which
>> typically generates the replacement data in a few milliseconds, and
>> tries to write back to the first disk.  *That* instantly fails,
>> since the controller is resetting the link and the drive is still
>> in la-la land trying to read the data.  MD will tolerate several
>> bad reads before it kicks out a drive, but will immediately kick if
>> a write fails.
>> 
>> By the time you come to investigate, the drive has completed its 
>> timeout, the link has reset, and the otherwise good drive is 
>> sitting idle (failed).
> 
> I admit I omitted the handling of the error md gets in the case of 
> linux itself timing out the drive, because I don't know how that's 
> handled. For example:
> 
> When you say, "the linux driver stack times out, it passes the error 
> to MD," what error is passed? Is it the same (I think it's 0x40) read
> error that the drive would have produced, along with affected LBAs?
> Does the driver know the affected LBA's, maybe by inference? 
> Otherwise md wouldn't know what replacement data to generate. Or is 
> it a different error, neither a read nor write error, that causes md 
> to bounce the drive wholesale?

I haven't examined the code in detail, just watched patches pass on the
list.  :-)  But as I understand it, the error is returned with the
request that it belongs to, and the MD does not look at the drive error
code itself.  So MD know what read it was, for which member devices, but
doesn't care if the error came from the drive itself, or the controller,
or the driver.
> 
>> Any array running with mismatched timeouts will kick a drive on 
>> every unrecoverable read error, where it would likely have just 
>> fixed it.
> 
> This is the key phrase I was trying to get at.
> 
>> Sadly, many hobbyist arrays are built with desktop drives, and the
>>  timeouts are left mismatched.  When that hobbyist later learns
>> s/he should be scrubbing, the long-overdue scrub is very likely to
>>  produce UREs on multiple drives (BOOM).
> 
> Or even if they have been scrubbing all along.

Yes.  But in that case, they've probably lost arrays without
understanding why.

> If the drive recovers the data inside of 30 seconds, and also doesn't
> relocated the data to a new sector (? I have no idea when drives do
> this on their own; I know they will do it on a write failure but I'm
> unclear when they do it on persistent read "difficulty") the scrub
> has no means  of even being aware there's a problem to fix!

I understand that it varies.  But drives generally only reallocate on a
write, and only if they are primed to verify the write at that sector by
a previous URE at that spot.  Those show up in a smartctl report as
"Pending".

> Given the craptastic state of affairs that manufacturers disallow a 
> simple setting change to ask the drive to do LESS error correction, 
> the recommendation to buy a different drive that can be so 
> configured, is the best suggestion.

For a while, it was Hitachi Deskstar or enterprise.  Western Digital's
new "Red" series appears to be an attempt to deal with the backlash.

> Alternative 1 is to change the linux driver timeout to maybe upwards
>  of two minutes, and then deal with the fall out of that behavior, 
> which could be worse than a drive being booted out of the array 
> sooner.

Yes.  Some servers will time out a connection in 90 seconds if a reply
is delayed.  To be safe with desktop drives, a timeout of 120 seconds
seems to be necessary.  I wouldn't be surprised if certain drives needed
more, but I have insufficient experience.

> And a very distant alternative 2 is to zero or Secure Erase the drive
> every so often, in hopes of avoiding bad sectors altogether — is
> tedious, as well as implies either putting the drive into a degraded
> state, or cycling a spare drive.

No.  This still isn't safe.  UREs can happen at any time, and are spec'd
to occur at about every 12TB read.  Even on a freshly wiped drive.
Spares don't help either, as a rebuild onto a spare stresses the rest of
the array, and is likely to expose any developing UREs.

> And at the point you're going to buy a spare drive for this fiasco, 
> you might as well just buy drives suited for the purpose.

Options are:

A) Buy Enterprise drives.  They have appropriate error timeouts and work
properly with MD right out of the box.

B) Buy Desktop drives with SCTERC support.  They have inappropriate
default timeouts, but can be set to an appropriate value.  Udev or boot
script assistance is needed to call smartctl to set it.  They do *not*
work properly with MD out of the box.

C) Suffer with desktop drives without SCTERC support.  They cannot be
set to appropriate error timeouts.  Udev or boot script assistance is
needed to set a 120 second driver timeout in sysfs.  They do *not* work
properly with MD out of the box.

D) Lose your data during spare rebuild after your first URE.  (Odds in
proportion to array size.)

One last point bears repeating:  MD is *not* a backup system, although
some people leverage it's features for rotating off-site backup disks.
Raid arrays are all about *uptime*.  They will not save you from
accidental deletion or other operator errors.  They will not save you if
your office burns down.  You need a separate backup system for critical
files.

Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html