Re: recommended way to add ssd cache to mdraid array

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Sun, 13 Jan 2013 16:20:43 -0700

On Jan 13, 2013, at 3:13 PM, Phil Turmel <philip@xxxxxxxxxx> wrote:
> 
> I haven't examined the code in detail, just watched patches pass on the
> list.  :-)  But as I understand it, the error is returned with the
> request that it belongs to, and the MD does not look at the drive error
> code itself.  So MD know what read it was, for which member devices, but
> doesn't care if the error came from the drive itself, or the controller,
> or the driver.

I think it does, but I don't know diddly about the code.

Only the drive knows what LBA's exhibit read and write errors. The linux driver doesn't. And the linux driver times out after 30 seconds by default, which is an eternity. There isn't just one request pending. There are dozens to hundreds of pending commands, representing possibly tens of thousands of LBAs in the drive's own cache. Once the drive goes into sector recovery, I'm pretty sure SATA drives, unlike SAS, basically go silent. That's probably when the linux timeout counter starts, in the meantime md is still talking to the linux driver making more requests.

This is a huge pile of just requests, not even the data it represents. Some of those requests made it to the drive, some are with the linux driver. I think it's much easier/efficient, when linux block device driver timeout arrives, for the linux driver to just nullify all requests, gives the drive the boot (lalala i can't hear you i don't care if you start talking to me again in 60 seconds), and tells md with a single error that the drive isn't even available. And I'd expect md does the only thing it can do if it gets such an error which is the same as a write error; it flags the device in the array as faulty. I'd be surprised if it tried to reconstruct data at all in such a case, without an explicit read error and LBA reported by the drive.

But I don't know the code, so I'm talking out my ass.

>> 
>>> Any array running with mismatched timeouts will kick a drive on 
>>> every unrecoverable read error, where it would likely have just 
>>> fixed it.
>> 
>> This is the key phrase I was trying to get at.
>> 
>>> Sadly, many hobbyist arrays are built with desktop drives, and the
>>> timeouts are left mismatched.  When that hobbyist later learns
>>> s/he should be scrubbing, the long-overdue scrub is very likely to
>>> produce UREs on multiple drives (BOOM).
>> 
>> Or even if they have been scrubbing all along.
> 
> Yes.  But in that case, they've probably lost arrays without
> understanding why.

Maybe. I don't have data on this. If recovery occurs in less than 30 seconds, they effectively get no indication. They'd have to be looking at ECC errors recorded by SMART. And not all drives record that attribute.

>> If the drive recovers the data inside of 30 seconds, and also doesn't
>> relocated the data to a new sector (? I have no idea when drives do
>> this on their own; I know they will do it on a write failure but I'm
>> unclear when they do it on persistent read "difficulty") the scrub
>> has no means  of even being aware there's a problem to fix!
> 
> I understand that it varies.  But drives generally only reallocate on a
> write, and only if they are primed to verify the write at that sector by
> a previous URE at that spot.  Those show up in a smartctl report as
> "Pending".

I'm pretty sure that attribute 197, usually called current pending sector, is only due to unrecoverable read error. Not from a sector that ECC detects transient read error and can correct for. The uncorrectable error, the firmware doesn't want to relocate the data on the sector because it clearly can't get it correct, so it just leaves it there until the sector is written and if on write there's persistent write failure, it gets remapped to a reserve sector. I don't know off hand if there is a read specific remap count, but attribute 5 'reallocated sectors' appears to be read or write remaps. *shrug*

I have several HDDs that have attribute 197, but not attribute 5. And an SSD with only attribute 5, not 197.

> 
>> Given the craptastic state of affairs that manufacturers disallow a 
>> simple setting change to ask the drive to do LESS error correction, 
>> the recommendation to buy a different drive that can be so 
>> configured, is the best suggestion.
> 
> For a while, it was Hitachi Deskstar or enterprise.  Western Digital's
> new "Red" series appears to be an attempt to deal with the backlash.

Yeah I know. :-) But I mean all drives could have this. It's a request for LESS, not more. I'm not asking for better ECC, although that would be nice, merely a faster time out as a settable option.

> 
>> And a very distant alternative 2 is to zero or Secure Erase the drive
>> every so often, in hopes of avoiding bad sectors altogether — is
>> tedious, as well as implies either putting the drive into a degraded
>> state, or cycling a spare drive.
> 
> No.  This still isn't safe.  UREs can happen at any time, and are spec'd
> to occur at about every 12TB read.  Even on a freshly wiped drive.
> Spares don't help either, as a rebuild onto a spare stresses the rest of
> the array, and is likely to expose any developing UREs.

Technically the spec statistic is "less than" 1 bit in 10^14 for a consumer disk. So it's not that you will get a URE at 12TB, but that you should be able to read at least 11.37TiB without a URE. It's entirely within the tolerance if the mean occurrence happens 2 bits shy of 10^15 bits, or 113TiB. By using "less than" the value is not a mean. It's likely a lot higher than 12TB, or we'd have total mayhem by now. That's only 3 reads of a 4TB drive otherwise. It's bad, but not that bad.

> 
>> And at the point you're going to buy a spare drive for this fiasco, 
>> you might as well just buy drives suited for the purpose.
> 
> Options are:
> 
> A) Buy Enterprise drives.  They have appropriate error timeouts and work
> properly with MD right out of the box.
> 
> B) Buy Desktop drives with SCTERC support.  They have inappropriate
> default timeouts, but can be set to an appropriate value.  Udev or boot
> script assistance is needed to call smartctl to set it.  They do *not*
> work properly with MD out of the box.
> 
> C) Suffer with desktop drives without SCTERC support.  They cannot be
> set to appropriate error timeouts.  Udev or boot script assistance is
> needed to set a 120 second driver timeout in sysfs.  They do *not* work
> properly with MD out of the box.
> 
> D) Lose your data during spare rebuild after your first URE.  (Odds in
> proportion to array size.)

That's a good summary.

> 
> One last point bears repeating:  MD is *not* a backup system, although
> some people leverage it's features for rotating off-site backup disks.
> Raid arrays are all about *uptime*.  They will not save you from
> accidental deletion or other operator errors.  They will not save you if
> your office burns down.  You need a separate backup system for critical
> files.

Yeah and that's why I'm sorta leery of this RAID 6 setup in the home. I think that people are reading that the odds of an array failure with RAID 5 are so high that they are better off adding one more drive for dual-parity, and *still* not having a real backup and restore plan. As if the RAID 6 is the faux-backup plan.

Some home NAS's, with BluRay vids, are so big that people just either need to stop such behavior, or get a used LTO 2 or 3 drive for their gargantuous backups.

Chris--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html