Hello, On Thu, 19 Dec 2013 15:43:16 +0000 Gruher, Joseph R wrote: [snip] > > It seems like this calculation ignores that in a large Ceph cluster with > triple replication having three drive failures doesn't automatically > guarantee data loss (unlike a RAID6 array)? If your data is triple > replicated and a copy of a given piece of data exists in three disks > separate disks in the cluster, and you have three disks fail, the odds > of it being the only three disks with copies of that data should be > pretty low for a very large number of disks. For the 600 disk cluster, > after the first disk fails you'd have a 2 in 599 chance of losing the > second copy when the second disk fails, then a 1 in 598 chance of losing > the third copy when the third disk fails, so even assuming a triple disk > failure has already happened don't you still have something like a > 99.94% chance that you didn't lose all copies of your data? And then if > there's only a 1 in 21.6 chance of having a triple disk failure outpace > recovery in the first place that gets you to something like 99.997% > reliability? > I think putting that number into perspective with a real event unfolding just now in a data center that's not local and where no monkeys are available might help. 24disk server, RAID6, one hotspare. 4 years old now, crappy Seagates failing, already replaced 6. On drive failed 2 days ago, yesterday nobody was available to go there and swap a fresh one in, last night the next drive failed and now somebody is dashing there with 2 spares. ^o^ More often than not the additional strain of recovery will push disks over the edge, aside from increasing likelihood for clustered failures with certain drives or when reaching certain ages. Christian -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com