Re: Failure probability with largish deployments

Wolfgang Hennerbichler <wolfgang.hennerbichler@xxxxxxxxxxxxxxxx> · Thu, 19 Dec 2013 09:53:58 +0100

Hello,

although I don't know much about this topic, I believe that ceph erasure
encoding will probably solve a lot of these issues with some speed
tradeoff. With erasure encoding the replicated data eats way less disk
capacity, so you could use a higher replication factor with a lower disk
usage tradeoff.

Wolfgang

On 12/19/2013 09:39 AM, Christian Balzer wrote:
> 
> Hello,
> 
> In my "Sanity check" thread I postulated yesterday that to get the same
> redundancy and resilience for disk failures (excluding other factors) as
> my proposed setup (2 nodes, 2x 11 3TB HDs RAID6 per node, 2
> global hotspares, thus 4 OSDs) the "Ceph way" one would need need something
> like 6 nodes with 10 3TB HDs each, 3 way replication (to protect against
> dual disk failures) to get the similar capacity and a 7th identical node to
> allow for node failure/maintenance.
> 
> That was basically based on me thinking "must not get caught be a dual
> disk failure ever again", as that happened twice to me, once with a RAID5
> and the expected consequences, once with a RAID10 where I got lucky (8
> disks total each time).
> 
> However something was nagging me at the back of my brain and turned out to
> be my long forgotten statistics classes in school. ^o^
> 
> So I after reading some articles basically telling the same things I found
> this: https://www.memset.com/tools/raid-calculator/
> 
> Now this is based on assumptions, onto which I will add some more, but the
> last sentence on that page still is quite valid. 
> 
> So lets compare these 2 configurations above, I assumed 75GB/s recovery
> speed for the RAID6 configuration something I've seen in practice.
> Basically that's half speed, something that will be lower during busy hours
> and higher during off peak hours. I made the same assumption for Ceph with
> a 10Gb/s network, assuming 500GB/s recovery/rebalancing speeds. 
> The rebalancing would have to compete with other replication traffic
> (likely not much of an issue) and the actual speed/load of the individual
> drives involved. Note that if we assume a totally quiet setup, were 100%
> of all resources would be available for recovery the numbers would of
> course change, but NOT their ratios. 
> I went with the default disk lifetime of 3 years and 0 day replacement
> time. The latter of course gives very unrealistic results for anything w/o
> hotspare drive, but we're comparing 2 different beasts here.
> 
> So that all said, the results of that page that make sense in this
> comparison are the RAID6 +1 hotspare numbers. As in, how likely is a 3rd
> drive failure in the time before recovery is complete, the replacement
> setting of 0 giving us the best possible number and since one would deploy
> a Ceph cluster with sufficient extra capacity that's what we shall use.
> 
> For the RAID6 setup (12 HDs total) this gives us a pretty comfortable 
> 1 in 58497.9 ratio of data loss per year.
> Alas for the 70 HDs in the comparable Ceph configuration we wind up with
> just a 1 in 13094.31 ratio, which while still quite acceptable clearly
> shows where this is going. 
> 
> So am I completely off my wagon here? 
> How do people deal with this when potentially deploying hundreds of disks
> in a single cluster/pool?
> 
> I mean, when we get too 600 disks (and that's just one rack full, OK,
> maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage
> servers (or 72 disk per 4U if you're happy with killing another drive when
> replacing a faulty one in that Supermicro contraption), that ratio is down
> to 1 in 21.6 which is way worse than that 8disk RAID5 I mentioned up there.
> 
> Regards,
> 
> Christian
> 

-- 
DI (FH) Wolfgang Hennerbichler
Software Development
Unit Advanced Computing Technologies
RISC Software GmbH
A company of the Johannes Kepler University Linz

IT-Center
Softwarepark 35
4232 Hagenberg
Austria

Phone: +43 7236 3343 245
Fax: +43 7236 3343 250
wolfgang.hennerbichler@xxxxxxxxxxxxxxxx
http://www.risc-software.at
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com