>> >> I haven't done the actual calculations, but given some % chance of disk >> failure, I would assume that losing x out of y disks has roughly the >> same chance as losing 2*x out of 2*y disks over the same period. >> >> That's also why you generally want to limit RAID5 arrays to maybe 6 >> disks or so and move to RAID6 for bigger arrays. For arrays bigger than >> 20 disks you would usually split those into separate arrays, just to >> keep the (parity disks / total disks) fraction high enough. >> >> With regard to data safety I would guess that 3+2 and 6+4 are roughly >> equal, although the behaviour of 6+4 is probably easier to predict >> because bigger numbers makes your calculations less dependent on >> individual deviations in reliability. >> >> Do you guys feel this argument is valid? > > Here is how I reason about it, roughly: > > If the probability of loosing a disk is 0.1%, the probability of loosing two disks simultaneously (i.e. before the failure can be recovered) would be 0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four disks becomes 0.0001% > > Accurately calculating the reliability of the system as a whole is a lot more complex (see https://wiki.ceph.com/Development/Add_erasure_coding_to_the_durability_model/ for more information). > > Cheers Okay, I see that in your calculation, you leave the total amount of disks completely out of the equation. The link you provided is very useful indeed and does some actual calculations. Interestingly, the example in the details page [1] use k=32 and m=32 for a total of 64 blocks. Those are very much bigger values than Mark Nelson mentioned earlier. Is that example merely meant to demonstrate the theoretical advantages, or would you actually recommend using those numbers in practice. Let's assume that we have at least 64 OSD's available, would you recommend k=32 and m=32? [1] https://wiki.ceph.com/Development/Add_erasure_coding_to_the_durability_model/Technical_details_on_the_model