Re: Failure probability with largish deployments

"Gruher, Joseph R" <joseph.r.gruher@xxxxxxxxx> · Thu, 19 Dec 2013 15:43:16 +0000

>-----Original Message-----
>From: ceph-users-bounces@xxxxxxxxxxxxxx [mailto:ceph-users-
>bounces@xxxxxxxxxxxxxx] On Behalf Of Gregory Farnum
>Sent: Thursday, December 19, 2013 7:20 AM
>To: Christian Balzer
>Cc: ceph-users@xxxxxxxxxxxxxx
>Subject: Re:  Failure probability with largish deployments
>
>On Thu, Dec 19, 2013 at 12:39 AM, Christian Balzer <chibi@xxxxxxx> wrote:
>>
>> Hello,
>>
>> In my "Sanity check" thread I postulated yesterday that to get the
>> same redundancy and resilience for disk failures (excluding other
>> factors) as my proposed setup (2 nodes, 2x 11 3TB HDs RAID6 per node,
>> 2 global hotspares, thus 4 OSDs) the "Ceph way" one would need need
>> something like 6 nodes with 10 3TB HDs each, 3 way replication (to
>> protect against dual disk failures) to get the similar capacity and a
>> 7th identical node to allow for node failure/maintenance.
>>
>> That was basically based on me thinking "must not get caught be a dual
>> disk failure ever again", as that happened twice to me, once with a
>> RAID5 and the expected consequences, once with a RAID10 where I got
>> lucky (8 disks total each time).
>>
>> However something was nagging me at the back of my brain and turned
>> out to be my long forgotten statistics classes in school. ^o^
>>
>> So I after reading some articles basically telling the same things I
>> found
>> this: https://www.memset.com/tools/raid-calculator/
>>
>> Now this is based on assumptions, onto which I will add some more, but
>> the last sentence on that page still is quite valid.
>>
>> So lets compare these 2 configurations above, I assumed 75GB/s
>> recovery speed for the RAID6 configuration something I've seen in practice.
>> Basically that's half speed, something that will be lower during busy
>> hours and higher during off peak hours. I made the same assumption for
>> Ceph with a 10Gb/s network, assuming 500GB/s recovery/rebalancing
>speeds.
>> The rebalancing would have to compete with other replication traffic
>> (likely not much of an issue) and the actual speed/load of the
>> individual drives involved. Note that if we assume a totally quiet
>> setup, were 100% of all resources would be available for recovery the
>> numbers would of course change, but NOT their ratios.
>> I went with the default disk lifetime of 3 years and 0 day replacement
>> time. The latter of course gives very unrealistic results for anything
>> w/o hotspare drive, but we're comparing 2 different beasts here.
>>
>> So that all said, the results of that page that make sense in this
>> comparison are the RAID6 +1 hotspare numbers. As in, how likely is a
>> 3rd drive failure in the time before recovery is complete, the
>> replacement setting of 0 giving us the best possible number and since
>> one would deploy a Ceph cluster with sufficient extra capacity that's what
>we shall use.
>>
>> For the RAID6 setup (12 HDs total) this gives us a pretty comfortable
>> 1 in 58497.9 ratio of data loss per year.
>> Alas for the 70 HDs in the comparable Ceph configuration we wind up
>> with just a 1 in 13094.31 ratio, which while still quite acceptable
>> clearly shows where this is going.
>>
>> So am I completely off my wagon here?
>> How do people deal with this when potentially deploying hundreds of
>> disks in a single cluster/pool?
>>
>> I mean, when we get too 600 disks (and that's just one rack full, OK,
>> maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
>> servers (or 72 disk per 4U if you're happy with killing another drive
>> when replacing a faulty one in that Supermicro contraption), that
>> ratio is down to 1 in 21.6 which is way worse than that 8disk RAID5 I
>mentioned up there.
>
>I don't know what assumptions that probability calculator is making (and I
>think they're overly aggressive about the 3x replication, at least if you're
>seeing 1 in 21.6 that doesn't match previous numbers I've seen), but yes: as
>you get larger and larger numbers of disks, your probabilities of failure go way
>up. This is a thing that people with large systems deal with. The tradeoffs that
>Ceph makes, you get about the same mean-time-to-failure as a collection of
>RAID systems of equivalent size (recovery times are much shorter, but more
>disks are involved whose failure can cause data loss), but you lose much less
>data in any given incident.
>As Wolfgang mentioned, erasure coded pools will handle this better because
>they can provide much larger failure counts in a reasonable disk overhead.
>-Greg

It seems like this calculation ignores that in a large Ceph cluster with triple replication having three drive failures doesn't automatically guarantee data loss (unlike a RAID6 array)?  If your data is triple replicated and a copy of a given piece of data exists in three disks separate disks in the cluster, and you have three disks fail, the odds of it being the only three disks with copies of that data should be pretty low for a very large number of disks.  For the 600 disk cluster, after the first disk fails you'd have a 2 in 599 chance of losing the second copy when the second disk fails, then a 1 in 598 chance of losing the third copy when the third disk fails, so even assuming a triple disk failure has already happened don't you still have something like a 99.94% chance that you didn't lose all copies of your data?  And then if there's only a 1 in 21.6 chance of having a triple disk failure outpace recovery in the first place that gets you to something like 99.997% reli
 ability?
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com