Re: Sunfire X4500 recommendations

david@xxxxxxx · Thu, 29 Mar 2007 18:41:20 -0700 (PDT)

On Thu, 29 Mar 2007, Matt Smiley wrote:

Hi David,

Thanks for your feedback!  I'm rather a newbie at this, and I do appreciate the critique.

First, let me correct myself: The formulas for the risk of loosing data when you loose 2 and 3 disks shouldn't have included the first term (g/n).  I'll give the corrected formulas and tables at the end of the email.

please explain why you are saying that the risk of loosing any 1 disk is
1/n. shouldn't it be probability of failure * n instead?

1/n represents the assumption that all disks have an equal probability of being the next one to fail.  This seems like a fair assumption in general for the active members of a stripe (not including hot spares).  A possible exception would be the parity disks (because reads always skip them and writes always hit them), but that's only a consideration if the RAID configuration used dedicated disks for parity instead of distributing it across the RAID 5/6 group members.  Apart from that, whether the workload is write-heavy or read-heavy, sequential or scattered, the disks in the stripe ought to handle a roughly equivalent number of iops over their lifetime.

only assuming that you have a 100% chance of some disk failing. if you 
have 15 disks in one array and 60 disks in another array the chances of 
having _some_ failure in the 15 disk array is only 1/4 the chance of 
having a failure of _some_ disk in the 60 disk array

following this logic the risk of loosing all 48 disks in a single group of
48 would be 100%

Exactly.  Putting all disks in one group is RAID 0 -- no data protection.  If you loose even 1 active member of the stripe, the probability of loosing your data is 100%.

but by your math, the chance of failure with dual parity if a 48 disk 
raid5 was also 100%, this is just wrong.

also what you are looking for is the probability of the second (and third)
disks failing in time X (where X is the time nessasary to notice the
failure, get a replacement, and rebuild the disk)

Yep, that's exactly what I'm looking for.  That's why I said, "these 
probabilities are only describing the case where we don't have enough 
time between disk failures to recover the array."  My goal wasn't to 
estimate how long time X is.  (It doesn't seem like a generalizable 
quantity; due partly to logistical and human factors, it's unique to 
each operating environment.)  Instead, I start with the assumption that 
time X has been exceeded, and we've lost a 2nd (or 3rd) disk in the 
array.  Given that assumption, I wanted to show the probability that the 
loss of the 2nd disk has caused the stripe to become unrecoverable.

Ok, this is the chance that if you loose that N disks without replacing 
any of them how much data are you likly to loose in different arrays.

We know that RAID 10 and 50 can tolerate the loss of anywhere between 1 
and n/g disks, depending on how lucky you are.  I wanted to quantify the 
amount of luck required, as a risk management tool.  The duration of 
time X can be minimized with hot spares and attentive administrators, 
but the risk after exceeding time X can only be minimized (as far as I 
know) by configuring the RAID stripe with small enough underlying 
failure groups.

but I don't think this is the question anyone is really asking.

what people want to know isn't 'how many disks can I loose without 
replacing them before I loose data' what they want to know is ' with this 
configuration (including a drive replacement time of Y for the first N 
drives and Z for drives after that), what are the odds of loosing data'

and for the second question the chance of failure of additional disks 
isn't 100%.

the killer is the time needed to rebuild the disk, with multi-TB arrays
is't sometimes faster to re-initialize the array and reload from backup
then it is to do a live rebuild (the kernel.org servers had a raid failure
recently and HPA mentioned that it took a week to rebuild the array, but
it would have only taken a couple days to do a restore from backup)

That's very interesting.  I guess the rebuild time also would depend on 
how large the damaged failure group was.  Under RAID 10, for example, I 
think you'd still only have to rebuild 1 disk from its mirror, 
regardless of how many other disks were in the stripe, right?  So 
shortening the rebuild time may be another good motivation to keep the 
failure groups small.

correct, however you have to decide how much this speed is worth to you. 
if you are building a ~20TB array you can do this with ~30 drives with 
single or dual parity, or ~60 drives with RAID 10.

remember the big cost of arrays like this isn't even the cost of the 
drives (although you are talking an extra $20,000 or so there), but the 
cost of the power and cooling to run all those extra drives

add to this the fact that disk failures do not appear to be truely
independant from each other statisticly (see the recent studies released
by google and cmu), and I wouldn't bother with single-parity for a

I don't think I've seen the studies you mentioned.  Would you cite them 
please?

http://labs.google.com/papers/disk_failures.pdf

http://www.usenix.org/events/fast07/tech/schroeder/schroeder_html/index.html

This may not be typical of everyone's experience, but what I've 
seen during in-house load tests is an equal I/O rate for each disk in my 
stripe, using short-duration sampling intervals to avoid long-term 
averaging effects.  This is what I expected to find, so I didn't delve 
deeper.

Certainly it's true that some disks may be more heavily burdened than 
others for hours or days, but I wouldn't expect any bias from an 
application-driven access pattern to persist for a significant fraction 
of a disk's lifespan.  The only influence I'd expect to bias the 
cumulative I/O handled by a disk over its entire life would be its role 
in the RAID configuration.  Hot spares will have minimal wear-and-tear 
until they're activated.  Dedicated parity disks will probably live 
longer than data disks, unless the workload is very heavily oriented 
towards small writes (e.g. logging).

multi-TB array. If the data is easy to recreate (including from backup) or
short lived (say a database of log data that cycles every month or so) I
would just do RAID-0 and plan on loosing the data on drive failure (this
assumes that you can afford the loss of service when this happens). if the
data is more important then I'd do dual-parity or more, along with a hot
spare so that the rebuild can start as soon as the first failure is
noticed by the system to give myself a fighting chance to save things.

That sounds like a fine plan.  In my case, downtime is unacceptible 
(which is, of course, why I'm interested in quantifying the 
probabilities of data loss).

Here are the corrected formulas:

Let:
  g = number of disks in each group (e.g. mirroring = 2; single-parity = 3 or more; dual-parity = 4 or more)
  n = total number of disks
  risk of loosing any 1 disk = 1/n
Then we have:
  risk of loosing 1 disk from a particular group = g/n

assuming you loose one disk

  risk of loosing 2 disks in the same group = (g-1)/(n-1)

assuming that you loose two disks without replaceing either one (including 
not having a hot-spare)

  risk of loosing 3 disks in the same group = (g-1)/(n-1) * (g-2)/(n-2)

assuming that you loose three disks without replacing any of them 
(including not having a hot spare)

For the x4500, we have 48 disks.  If we stripe our data across all those 
disks, then these are our configuration options:

RAID 10 or 50 -- Mirroring or single-parity must loose 2 disks from the same group to loose data:
disks_per_group  num_groups  total_disks  usable_disks  risk_of_data_loss
            2          24           48            24              2.13%
            3          16           48            32              4.26%
            4          12           48            36              6.38%
            6           8           48            40             10.64%
            8           6           48            42             14.89%
           12           4           48            44             23.40%
           16           3           48            45             31.91%
           24           2           48            46             48.94%
           48           1           48            47            100.00%

however, back in the real world, the chances of loosing three disks is 
considerably less then the chance of loosing two disks. so to compare 
apples to apples you need to add the following

chance of data loss if useing double-parity 0% in all configurations.

RAID 60 or Z2 -- Double-parity must loose 3 disks from the same group to loose data:
disks_per_group  num_groups  total_disks  usable_disks  risk_of_data_loss
            2          24           48           n/a                n/a
            3          16           48            16              0.09%
            4          12           48            24              0.28%
            6           8           48            32              0.93%
            8           6           48            36              1.94%
           12           4           48            40              5.09%
           16           3           48            42              9.71%
           24           2           48            44             23.40%
           48           1           48            46            100.00%

again, to compare apples to apples you would need to add the following 
(calculating the odds for each group, they will be scareily larger then 
the 2-drive failure chart)

RAID 10 or 50 -- Mirroring or single-parity must loose 2 disks from the same group to loose data:
disks_per_group  num_groups  total_disks  usable_disks  risk_of_data_loss
            2          24           48            24
            3          16           48            32
            4          12           48            36
            6           8           48            40
            8           6           48            42
           12           4           48            44
           16           3           48            45
           24           2           48            46
           48           1           48            47

however, since it's easy to add a hot-spare drive, you really need to 
account for it. there's still a chance of all the drives going bad before 
the hot-spare can be built to replace the first one, but it's a lot lower 
then if you don't have a hot-spare and require the admins to notice and 
replace the failed disk.

if you say that there is a 10% chance of a disk failing each year 
(significnatly higher then the studies listed above, but close enough) 
then this works out to ~0.001% chance of a drive failing per hour (a 
reasonably round number to work with)

to write 750G at ~45MB/sec takes 5 hours of 100% system throughput, or ~50 
hours at 10% of the system throughput (background rebuilding)

if we cut this in half to account for inefficiancies in retrieving data 
from other disks to calculate pairity it can take 100 hours (just over 
four days) to do a background rebuild, or about 0.1% chance for each disk 
of loosing a seond disk. with 48 drives this is ~5% chance of loosing 
everything with single-parity, however the odds of loosing two disks 
during this time are .25% so double-parity is _well_ worth it.

chance of loosing data before hotspare is finished rebuilding (assumes one 
hotspare per group, you may be able to share a hotspare between multiple 
groups to get slightly higher capacity)

RAID 60 or Z2 -- Double-parity must loose 3 disks from the same group to loose data:
disks_per_group  num_groups  total_disks  usable_disks  risk_of_data_loss
            2          24           48           n/a                n/a
            3          16           48           n/a         (0.0001% with manual replacement of drive)
            4          12           48            12         0.0009%
            6           8           48            24         0.003%
            8           6           48            30         0.006%
           12           4           48            36         0.02%
           16           3           48            39         0.03%
           24           2           48            42         0.06%
           48           1           48            45         0.25%

RAID 10 or 50 -- Mirroring or single-parity must loose 2 disks from the same group to loose data:
disks_per_group  num_groups  total_disks  usable_disks  risk_of_data_loss
            2          24           48            n/a        (~0.1% with manual replacement of drive)
            3          16           48            16         0.2%
            4          12           48            24         0.3%
            6           8           48            32         0.5%
            8           6           48            36         0.8%
           12           4           48            40         1.3%
           16           3           48            42         1.7%
           24           2           48            44         2.5%
           48           1           48            46         5%

so if I've done the math correctly the odds of losing data with the 
worst-case double-parity (one large array including hotspare) are about 
the same as the best case single parity (mirror+ hotspare), but with 
almost triple the capacity.

David Lang