Hello Hector, Firstly I'm so happy somebody actually replied. On Tue, 2 Apr 2019 16:43:10 +0900 Hector Martin wrote: > On 31/03/2019 17.56, Christian Balzer wrote: > > Am I correct that unlike with with replication there isn't a maximum size > > of the critical path OSDs? > > As far as I know, the math for calculating the probability of data loss > wrt placement groups is the same for EC and for replication. Replication > to n copies should be equivalent to EC with k=1 and m=(n-1). > > > Meaning that with replication x3 and typical values of 100 PGs per OSD at > > most 300 OSDs form a set out of which 3 OSDs need to fail for data loss. > > The statistical likelihood for that based on some assumptions > > is significant, but not nightmarishly so. > > A cluster with 1500 OSDs in total is thus as susceptible as one with just > > 300. > > Meaning that 3 disk losses in the big cluster don't necessarily mean data > > loss at all. > > Someone might correct me on this, but here's my take on the math. > > If you have 100 PGs per OSD, 1500 OSDs, and replication 3, you have: > > 1500 * 100 / 3 = 50000 pool PGs, and thus 50000 (hopefully) different > 3-sets of OSDs. > I think your math is essentially correct, but so seems to be the "hopefully" part. I did a quick peek at my test cluster (20 OSDs, 5 hosts) and a replica 2 pool with 1024 PGs. Which should give us 1000 sets of OSDs to choose from given your formula. Just looking at OSD 0 and the first 6 other OSDs out of that list of 1024 PGs gives us this: --- UP_PRIMARY ACTING [0,1] 0 [0,2] 0 [0,2] 0 [0,2] 0 [0,3] 0 [0,3] 0 [0,3] 0 [0,3] 0 [0,3] 0 [0,5] 0 [0,5] 0 [0,5] 0 [0,5] 0 [0,5] 0 [0,6] 0 [0,6] 0 [1,0] 1 [1,0] 1 [1,0] 1 [2,0] 2 [2,0] 2 [2,0] 2 [3,0] 3 [3,0] 3 [5,0] 5 [5,0] 5 [5,0] 5 [6,0] 6 [6,0] 6 [6,0] 6 --- So this looks significantly worse than the theoretical set of choices. Another thing to look at here is of course critical period and disk failure probabilities, these guys explain the logic behind their calculator, would be delighted if you could have a peek and comment. https://www.memset.com/support/resources/raid-calculator/ Thanks again for the feedback! Christian > (1500 choose 3) = 561375500 possible sets of 3 OSDs > > Therefore if you lose 3 random OSDs, your chance of (any) data loss is > 50000/561375500 = ~0.008%. (and if you *do* get unlucky and hit the > wrong set of 3 OSDs, you can expect to lose 1/50000 = ~0.002% of your data) > > > However it feels that with EC all OSDs can essentially be in the same set > > and thus having 6 out of 1500 OSDs fail in a 10+5 EC pool with 100 PGs per > > OSD would affect every last object in that cluster, not just a subset. > > The math should work essentially the same way: > > 1500 * 100 / 15 = 10000 15-sets of OSDs > > (1500 choose 15) = 3.1215495e+35 possible 15-sets of OSDs > > Now if 6 OSDs fail that will affect many potential 15-sets of OSDs > chosen with the remaining OSD in the cluster: > > ((1500 - 6) choose 9) = 9.9748762e+22 > > Putting it together, the chance of any data loss from a simultaneous > loss of 6 random OSDs: > > 10000 / 3.1215495e+35 * 9.9748762e+22 = 0.00000032% > > And if you *do* get unlucky you can expect to lose 1/10000 = ~0.01% of > your data. > > So your chance of data loss is much smaller with such a wide EC > encoding, but if you do lose a PG you'll lose more data because there > are fewer PGs. > > Feedback on my math welcome. > -- > Hector Martin (hector@xxxxxxxxxxxxxx) > Public Key: https://mrcn.st/pub > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Rakuten Communications _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com