Re: high throughput storage server?

David Brown <david@xxxxxxxxxxxxxxx> · Fri, 25 Feb 2011 09:20:41 +0100

On 25/02/2011 00:30, Stan Hoeppner wrote:
David Brown put forth on 2/24/2011 5:24 AM:

My understanding of RAID controllers (software or hardware) is that they
consider a drive to be either "good" or "bad".  So if you get an URE,
the controller considers the drive "bad" and ejects it from the array.
It doesn't matter if it is an URE or a total disk death.

Maybe hardware RAID controllers do something else here - you know far
more about them than I do.

Most HBA and SAN RAID firmware I've dealt with kicks drives offline
pretty quickly at any sign of an unrecoverable error.  I've also seen
drives kicked simply because the RAID firmware didn't like the drive
firmware.  I have a fond (sarcasm) memory of DAC960s kicking ST118202
18GB Cheetahs offline left and right in the late 90s.  The fact I still
recall that Seagate drive# after 10+ years should be informative
regarding the severity of that issue.  :(

The idea of the md raid "bad block list" is that there is a medium
ground - you can have disks that are "mostly good".

Everything I've read and seen in the last few years regarding hard disk
technology says that platter manufacturing quality and tolerance are so
high on modern drives that media defects are rarely, if ever, seen by
the customer, as they're mapped out at the factory.  The platters don't
suffer wear effects, but the rest of the moving parts do.  From what
I've read/seen, "media" errors observed in the wild today are actually
caused by mechanical failures due to physical wear on various moving
parts:  VC actuator pivot bearing/race, spindle bearings, etc.
Mechanical failures tend to show mild "media errors" in the beginning
and get worse with time as moving parts go further out of alignment.
Thus, as I see it, any UREs on a modern drive represent a "Don't trust
me--Replace me NOW" flag.  I could be all wrong here, but this is what
I've read, and seen in manufacturer videos from WD and Seagate.

That's very useful information to know - I don't go through nearly 
enough disks myself to be able to judge these things (and while I read 
lots of stuff on the web, I don't see /everything/ !).  Thanks.

However, this still sounds to me like a drive with UREs is dying but not 
dead yet.  Assuming you are correct here (and I've no reason to doubt 
that - unless someone else disagrees), it means that a disk with UREs 
will be dying quickly rather than dying slowly.  But if the non-URE data 
on the disk can be used to make a rebuild faster and safer, then surely 
that is worth doing?

It may be that when a disk has had an URE and therefore an entry in the 
bad block list, then it should be marked read-only and only used for 
data recovery and "hot replace" rebuilds.  But until it completely 
croaks, it is still better than no disk at all while the rebuild is in 
progress.

Supposing you have a RAID6 array, and one disk has died completely.  It
gets replaced by a hot spare, and rebuild begins.  As the rebuild
progresses, disk 1 gets an URE.  Traditional handling would mean disk 1
is ejected, and now you have a double-degraded RAID6 to rebuilt.  When
you later get an URE on disk 2, you have lost data for that stripe - and
the whole raid is gone.

But with bad block lists, the URE on disk 1 leads to a bad block entry
on disk 1, and the rebuild continues.  When you later get an URE on disk
2, it's no problem - you use data from disk 1 and the other disks. URE's
are no longer a killer unless your set has no redundancy.

They're not a killer with RAID 6 anyway, are they?.  You can be
rebuilding one failed drive and suffer UREs left and right, as long as
you don't get two of them on two drives simultaneously in the same
stripe block read.  I think that's right.  Please correct me if not.

That's true as long as UREs do not cause that disk to be kicked out of 
the array.  With bad block support in md raid, a disk suffering an URE 
will /not/ be kicked out.  But my understanding (from what you wrote 
above) was that with hardware raid controllers, an URE /would/ cause a 
disk to be kicked out.  Or am I mixing something up again?

URE's are also what I worry about with RAID1 (including RAID10)
rebuilds.  If a disk has failed, you are right in saying that the
chances of the second disk in the pair failing completely are tiny.  But
the chances of getting an URE on the second disk during the rebuild are
not negligible - they are small, but growing with each new jump in disk
size.

I touched on this in my other reply, somewhat tongue-in-cheek mentioning
3 leg and 4 leg RAID10.  At current capacities and URE ratings I'm not
worried about it with mirror pairs.  If URE ratings haven't increased
substantially by the time our avg drive capacity hits 10GB I'll start to
worry.

Somewhat related to this, does any else here build their arrays from the
smallest cap drives they can get away with, preferably single platter
models when possible?  I adopted this strategy quite some time ago,
mostly to keep rebuild times to a minimum, keep rotational mass low to
consume the least energy since using more drives, but also with the URE
issue in the back of my mind.  Anecdotal evidence tends to point to the
trend of OPs going with fewer gargantuan drives instead of many smaller
ones.  Maybe that's just members of this list, whose criteria may be
quite different from the typical enterprise data center.

With md raid's future bad block lists and hot replace features, then an
URE on the second disk during rebuilds is only a problem if the first
disk has died completely - if it only had a small problem, then the "hot
replace" rebuild will be able to use both disks to find the data.

What happens when you have multiple drives at the same or similar bad
block count?

You replace them all.  Once a drive reaches a certain number of bad 
blocks (and that threshold may be just 1, or it may be more), you should 
replace it.  There isn't any reason not to do hot replace builds on 
multiple drives simultaneously, if you've got the drives and drive bays 
on hand - apart from at the bad blocks, they replacement is just a 
straight disk to disk copy.

I know you are more interested in hardware raid than software raid, but
I'm sure you'll find some interesting points in Neil's writings.  If you
don't want to read through the thread, at least read his blog post.

<http://neil.brown.name/blog/20110216044002>

Will catch up.  Thanks for the blog link.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html