Re: Failure probability with largish deployments

Gregory Farnum <greg@xxxxxxxxxxx> · Thu, 19 Dec 2013 07:20:16 -0800



On Thu, Dec 19, 2013 at 12:39 AM, Christian Balzer <chibi@xxxxxxx> wrote:
>
> Hello,
>
> In my "Sanity check" thread I postulated yesterday that to get the same
> redundancy and resilience for disk failures (excluding other factors) as
> my proposed setup (2 nodes, 2x 11 3TB HDs RAID6 per node, 2
> global hotspares, thus 4 OSDs) the "Ceph way" one would need need something
> like 6 nodes with 10 3TB HDs each, 3 way replication (to protect against
> dual disk failures) to get the similar capacity and a 7th identical node to
> allow for node failure/maintenance.
>
> That was basically based on me thinking "must not get caught be a dual
> disk failure ever again", as that happened twice to me, once with a RAID5
> and the expected consequences, once with a RAID10 where I got lucky (8
> disks total each time).
>
> However something was nagging me at the back of my brain and turned out to
> be my long forgotten statistics classes in school. ^o^
>
> So I after reading some articles basically telling the same things I found
> this: https://www.memset.com/tools/raid-calculator/
>
> Now this is based on assumptions, onto which I will add some more, but the
> last sentence on that page still is quite valid.
>
> So lets compare these 2 configurations above, I assumed 75GB/s recovery
> speed for the RAID6 configuration something I've seen in practice.
> Basically that's half speed, something that will be lower during busy hours
> and higher during off peak hours. I made the same assumption for Ceph with
> a 10Gb/s network, assuming 500GB/s recovery/rebalancing speeds.
> The rebalancing would have to compete with other replication traffic
> (likely not much of an issue) and the actual speed/load of the individual
> drives involved. Note that if we assume a totally quiet setup, were 100%
> of all resources would be available for recovery the numbers would of
> course change, but NOT their ratios.
> I went with the default disk lifetime of 3 years and 0 day replacement
> time. The latter of course gives very unrealistic results for anything w/o
> hotspare drive, but we're comparing 2 different beasts here.
>
> So that all said, the results of that page that make sense in this
> comparison are the RAID6 +1 hotspare numbers. As in, how likely is a 3rd
> drive failure in the time before recovery is complete, the replacement
> setting of 0 giving us the best possible number and since one would deploy
> a Ceph cluster with sufficient extra capacity that's what we shall use.
>
> For the RAID6 setup (12 HDs total) this gives us a pretty comfortable
> 1 in 58497.9 ratio of data loss per year.
> Alas for the 70 HDs in the comparable Ceph configuration we wind up with
> just a 1 in 13094.31 ratio, which while still quite acceptable clearly
> shows where this is going.
>
> So am I completely off my wagon here?
> How do people deal with this when potentially deploying hundreds of disks
> in a single cluster/pool?
>
> I mean, when we get too 600 disks (and that's just one rack full, OK,
> maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
> servers (or 72 disk per 4U if you're happy with killing another drive when
> replacing a faulty one in that Supermicro contraption), that ratio is down
> to 1 in 21.6 which is way worse than that 8disk RAID5 I mentioned up there.

I don't know what assumptions that probability calculator is making
(and I think they're overly aggressive about the 3x replication, at
least if you're seeing 1 in 21.6 that doesn't match previous numbers
I've seen), but yes: as you get larger and larger numbers of disks,
your probabilities of failure go way up. This is a thing that people
with large systems deal with. The tradeoffs that Ceph makes, you get
about the same mean-time-to-failure as a collection of RAID systems of
equivalent size (recovery times are much shorter, but more disks are
involved whose failure can cause data loss), but you lose much less
data in any given incident.
As Wolfgang mentioned, erasure coded pools will handle this better
because they can provide much larger failure counts in a reasonable
disk overhead.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com