Best practice K/M-parameters EC pool

loic@xxxxxxxxxxx (Loic Dachary) · Thu, 28 Aug 2014 18:04:09 +0200

Hi Blair,

On 28/08/2014 16:38, Blair Bethwaite wrote:
> Hi Loic,
> 
> Thanks for the reply and interesting discussion.

I'm learning a lot :-)

> On 26 August 2014 23:25, Loic Dachary <loic at dachary.org> wrote:
>> Each time an OSD is lost, there is a 0.001*0.001 = 0.000001% chance that two other disks are lost before recovery. Since the disk that failed initialy participates in 100 PG, that is 0.000001% x 100 = 0.0001% chance that a PG is lost.
> 
> Seems okay, so you're just taking the max PG spread as the worst case
> (noting as demonstrated with my numbers that the spread could be
> lower).
> 
> ...actually, I could be way off here, but if the chance of any one
> disk failing in that time is 0.0001%, then assuming the first failure
> has already happened I'd have thought it would be more like:
> (0.0001% / 2) * 99 * (0.0001% / 2) * 98
> ?
> As you're essentially calculating the probability of one more disk out
> of the remaining 99 failing, and then another out of the remaining 98
> (and so on), within the repair window (dividing by the number of
> remaining replicas for which the probability is being calculated, as
> otherwise you'd be counting their chance of failure in the recovery
> window multiple times). And of course this all assumes the recovery
> continues gracefully from the remaining replica/s when another failure
> occurs...?

That makes sense. I chose to arbitrarily ignore the probability of the first failure to happen because the event is not bounded in time. The second failure matters as long as it happens in the interval it takes for the cluster to create the missing copies and that seemed more important. 

> Taking your followup correcting the base chances of failure into
> account, then that looks like:
> 99(1/100000 / 2) * 98(1/100000 / 2)
> = 9.702e-7
> 1 in 1030715

If a disk participates in 100 PG with replica 3, it means there is a maximum of 200 other disks involved (if the cluster is large enough and the odds of two disks being used together in more than one PG are very low). You are assuming that this total is 100 which seems a reasonable approximation. I guess it could be verified by tests on a crushmap. However, it also means that the second failing disk probably shares 2 PG with the first failing disk, in which case the 98 should rather be 2 (i.e. the number of PG that are down to one replica as a result of the double failure).   

> I'm also skeptical on the 1h recovery time - at the very least the
> issues regarding stalling client ops come into play here and may push
> the max_backfills down for operational reasons (after all, you can't
> have a general purpose volume storage service that periodically spikes
> latency due to normal operational tasks like recoveries).

If the cluster is overloaded (disks I/O, cluster network), re-creating the lost copies within less than 2h seems indeed unlikely.

>> Or the entire pool if it is used in a way that loosing a PG means loosing all data in the pool (as in your example, where it contains RBD volumes and each of the RBD volume uses all the available PG).
> 
> Well, there's actually another whole interesting conversation in here
> - assuming a decent filesystem is sitting on top of those RBDs it
> should be possible to get those filesystems back into working order
> and identify any lost inodes, and then, if you've got one you can
> recover from tape backup. BUT, if you have just one pool for these
> RBDs spread over the entire cluster then the amount of work to do that
> fsck-ing is quickly going to be problematic - you'd have to fsck every
> RBD! So I wonder if there is cause for partitioning large clusters
> into multiple pools, so that such a failure would (hopefully) have a
> more limited scope. Backups for DR purposes are only worth having (and
> paying for) if the DR plan is actually practical.
> 
>> If the pool is using at least two datacenters operated by two different organizations, this calculation makes sense to me. However, if the cluster is in a single datacenter, isn't it possible that some event independent of Ceph has a greater probability of permanently destroying the data ? A month ago I lost three machines in a Ceph cluster and realized on that occasion that the crushmap was not configured properly and that PG were lost as a result. Fortunately I was able to recover the disks and plug them in another machine to recover the lost PGs. I'm not a system administrator and the probability of me failing to do the right thing is higher than normal: this is just an example of a high probability event leading to data loss. In other words, I wonder if this 0.0001% chance of losing a PG within the hour following a disk failure matters or if it is dominated by other factors. What do you think ?
> 
> I wouldn't expect that number to be dominated by the chances of
> total-loss/godzilla events, but I'm no datacentre reliability guru (at
> least we don't have Godzilla here in Melbourne yet anyway). I couldn't
> very quickly find any stats on "one-in-one-hundred year" events that
> might actually destroy a datacentre. Availability is another question
> altogether, which you probably know the Uptime Institute has specific
> figures for tiers 1-4. But in my mind you should expect datacentre
> power outages as an operational (rather than disaster) event, and
> you'd want your Ceph cluster to survive them unscathed. If that
> Copysets paper mentioned a while ago has any merit (see
> http://hackingdistributed.com/2014/02/14/chainsets/ for more on that),
> then it seems like the chances of drive loss following an availability
> event are much higher than normal.

:-)

Cheers

-- 
Lo?c Dachary, Artisan Logiciel Libre

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 263 bytes
Desc: OpenPGP digital signature
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140828/0123699b/attachment.pgp>