Re: Erasure Coding failure domain (again)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello Hector,

Firstly I'm so happy somebody actually replied.

On Tue, 2 Apr 2019 16:43:10 +0900 Hector Martin wrote:

> On 31/03/2019 17.56, Christian Balzer wrote:
> > Am I correct that unlike with with replication there isn't a maximum size
> > of the critical path OSDs?  
> 
> As far as I know, the math for calculating the probability of data loss 
> wrt placement groups is the same for EC and for replication. Replication 
> to n copies should be equivalent to EC with k=1 and m=(n-1).
> 
> > Meaning that with replication x3 and typical values of 100 PGs per OSD at
> > most 300 OSDs form a set out of which 3 OSDs need to fail for data loss.
> > The statistical likelihood for that based on some assumptions
> > is significant, but not nightmarishly so.
> > A cluster with 1500 OSDs in total is thus as susceptible as one with just
> > 300.
> > Meaning that 3 disk losses in the big cluster don't necessarily mean data
> > loss at all.  
> 
> Someone might correct me on this, but here's my take on the math.
> 
> If you have 100 PGs per OSD, 1500 OSDs, and replication 3, you have:
> 
> 1500 * 100 / 3 = 50000 pool PGs, and thus 50000 (hopefully) different 
> 3-sets of OSDs.
>
I think your math is essentially correct, but so seems to be the
"hopefully" part.

I did a quick peek at my test cluster (20 OSDs, 5 hosts) and a replica 2
pool with 1024 PGs.
Which should give us 1000 sets of OSDs to choose from given your formula.
Just looking at OSD 0 and the first 6 other OSDs out of that list of 1024
PGs gives us this:
---
UP_PRIMARY 	ACTING
[0,1]              0  
[0,2]              0   
[0,2]              0   
[0,2]              0   
[0,3]              0    
[0,3]              0  
[0,3]              0   
[0,3]              0   
[0,3]              0   
[0,5]              0   
[0,5]              0   
[0,5]              0   
[0,5]              0   
[0,5]              0   
[0,6]              0   
[0,6]              0   
[1,0]              1   
[1,0]              1   
[1,0]              1   
[2,0]              2   
[2,0]              2   
[2,0]              2   
[3,0]              3   
[3,0]              3   
[5,0]              5   
[5,0]              5   
[5,0]              5   
[6,0]              6   
[6,0]              6  
[6,0]              6  
---

So this looks significantly worse than the theoretical set of choices.
 
Another thing to look at here is of course critical period and disk
failure probabilities, these guys explain the logic behind their
calculator, would be delighted if you could have a peek and comment.

https://www.memset.com/support/resources/raid-calculator/

Thanks again for the feedback!

Christian

> (1500 choose 3) = 561375500 possible sets of 3 OSDs
> 
> Therefore if you lose 3 random OSDs, your chance of (any) data loss is 
> 50000/561375500 = ~0.008%. (and if you *do* get unlucky and hit the 
> wrong set of 3 OSDs, you can expect to lose 1/50000 = ~0.002% of your data)
> 
> > However it feels that with EC all OSDs can essentially be in the same set
> > and thus having 6 out of 1500 OSDs fail in a 10+5 EC pool with 100 PGs per
> > OSD would affect every last object in that cluster, not just a subset.  
> 
> The math should work essentially the same way:
> 
> 1500 * 100 / 15 = 10000 15-sets of OSDs
> 
> (1500 choose 15) = 3.1215495e+35 possible 15-sets of OSDs
> 
> Now if 6 OSDs fail that will affect many potential 15-sets of OSDs 
> chosen with the remaining OSD in the cluster:
> 
> ((1500 - 6) choose 9) = 9.9748762e+22
> 
> Putting it together, the chance of any data loss from a simultaneous 
> loss of 6 random OSDs:
> 
> 10000 / 3.1215495e+35 * 9.9748762e+22 = 0.00000032%
> 
> And if you *do* get unlucky you can expect to lose 1/10000 = ~0.01% of 
> your data.
> 
> So your chance of data loss is much smaller with such a wide EC 
> encoding, but if you do lose a PG you'll lose more data because there 
> are fewer PGs.
> 
> Feedback on my math welcome.
> -- 
> Hector Martin (hector@xxxxxxxxxxxxxx)
> Public Key: https://mrcn.st/pub
> 


-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Rakuten Communications
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux