On 12/19/2013 09:39 AM, Christian Balzer wrote:
Hello, In my "Sanity check" thread I postulated yesterday that to get the same redundancy and resilience for disk failures (excluding other factors) as my proposed setup (2 nodes, 2x 11 3TB HDs RAID6 per node, 2 global hotspares, thus 4 OSDs) the "Ceph way" one would need need something like 6 nodes with 10 3TB HDs each, 3 way replication (to protect against dual disk failures) to get the similar capacity and a 7th identical node to allow for node failure/maintenance. That was basically based on me thinking "must not get caught be a dual disk failure ever again", as that happened twice to me, once with a RAID5 and the expected consequences, once with a RAID10 where I got lucky (8 disks total each time). However something was nagging me at the back of my brain and turned out to be my long forgotten statistics classes in school. ^o^ So I after reading some articles basically telling the same things I found this: https://www.memset.com/tools/raid-calculator/ Now this is based on assumptions, onto which I will add some more, but the last sentence on that page still is quite valid. So lets compare these 2 configurations above, I assumed 75GB/s recovery speed for the RAID6 configuration something I've seen in practice. Basically that's half speed, something that will be lower during busy hours and higher during off peak hours. I made the same assumption for Ceph with a 10Gb/s network, assuming 500GB/s recovery/rebalancing speeds. The rebalancing would have to compete with other replication traffic (likely not much of an issue) and the actual speed/load of the individual drives involved. Note that if we assume a totally quiet setup, were 100% of all resources would be available for recovery the numbers would of course change, but NOT their ratios. I went with the default disk lifetime of 3 years and 0 day replacement time. The latter of course gives very unrealistic results for anything w/o hotspare drive, but we're comparing 2 different beasts here. So that all said, the results of that page that make sense in this comparison are the RAID6 +1 hotspare numbers. As in, how likely is a 3rd drive failure in the time before recovery is complete, the replacement setting of 0 giving us the best possible number and since one would deploy a Ceph cluster with sufficient extra capacity that's what we shall use. For the RAID6 setup (12 HDs total) this gives us a pretty comfortable 1 in 58497.9 ratio of data loss per year. Alas for the 70 HDs in the comparable Ceph configuration we wind up with just a 1 in 13094.31 ratio, which while still quite acceptable clearly shows where this is going. So am I completely off my wagon here? How do people deal with this when potentially deploying hundreds of disks in a single cluster/pool?
I'd suggest to use different vendors for the disks, so that means you'll probably be mixing Seagate and Western Digital in such a setup.
In this case you can also rule out batch issues with disks, but the likelihood of the same disks failing becomes smaller as well.
Also, make sure that you define your crushmap that replicas never and up on the same physical host and if possible not in the same cabinet/rack.
I would never run with 60 drives in a single machine in a Ceph cluster, I'd suggest you use more machines with less disks per machine.
I mean, when we get too 600 disks (and that's just one rack full, OK, maybe 2 due to load and other issues ^o^) of those 4U 60 disk storage servers (or 72 disk per 4U if you're happy with killing another drive when replacing a faulty one in that Supermicro contraption), that ratio is down to 1 in 21.6 which is way worse than that 8disk RAID5 I mentioned up there. Regards, Christian
-- Wido den Hollander 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com