Best practice K/M-parameters EC pool

blair.bethwaite@xxxxxxxxx (Blair Bethwaite) · Fri, 29 Aug 2014 00:38:58 +1000

Hi Loic,

Thanks for the reply and interesting discussion.

On 26 August 2014 23:25, Loic Dachary <loic at dachary.org> wrote:
> Each time an OSD is lost, there is a 0.001*0.001 = 0.000001% chance that two other disks are lost before recovery. Since the disk that failed initialy participates in 100 PG, that is 0.000001% x 100 = 0.0001% chance that a PG is lost.

Seems okay, so you're just taking the max PG spread as the worst case
(noting as demonstrated with my numbers that the spread could be
lower).

...actually, I could be way off here, but if the chance of any one
disk failing in that time is 0.0001%, then assuming the first failure
has already happened I'd have thought it would be more like:
(0.0001% / 2) * 99 * (0.0001% / 2) * 98
?
As you're essentially calculating the probability of one more disk out
of the remaining 99 failing, and then another out of the remaining 98
(and so on), within the repair window (dividing by the number of
remaining replicas for which the probability is being calculated, as
otherwise you'd be counting their chance of failure in the recovery
window multiple times). And of course this all assumes the recovery
continues gracefully from the remaining replica/s when another failure
occurs...?

Taking your followup correcting the base chances of failure into
account, then that looks like:
99(1/100000 / 2) * 98(1/100000 / 2)
= 9.702e-7
1 in 1030715

I'm also skeptical on the 1h recovery time - at the very least the
issues regarding stalling client ops come into play here and may push
the max_backfills down for operational reasons (after all, you can't
have a general purpose volume storage service that periodically spikes
latency due to normal operational tasks like recoveries).

> Or the entire pool if it is used in a way that loosing a PG means loosing all data in the pool (as in your example, where it contains RBD volumes and each of the RBD volume uses all the available PG).

Well, there's actually another whole interesting conversation in here
- assuming a decent filesystem is sitting on top of those RBDs it
should be possible to get those filesystems back into working order
and identify any lost inodes, and then, if you've got one you can
recover from tape backup. BUT, if you have just one pool for these
RBDs spread over the entire cluster then the amount of work to do that
fsck-ing is quickly going to be problematic - you'd have to fsck every
RBD! So I wonder if there is cause for partitioning large clusters
into multiple pools, so that such a failure would (hopefully) have a
more limited scope. Backups for DR purposes are only worth having (and
paying for) if the DR plan is actually practical.

> If the pool is using at least two datacenters operated by two different organizations, this calculation makes sense to me. However, if the cluster is in a single datacenter, isn't it possible that some event independent of Ceph has a greater probability of permanently destroying the data ? A month ago I lost three machines in a Ceph cluster and realized on that occasion that the crushmap was not configured properly and that PG were lost as a result. Fortunately I was able to recover the disks and plug them in another machine to recover the lost PGs. I'm not a system administrator and the probability of me failing to do the right thing is higher than normal: this is just an example of a high probability event leading to data loss. In other words, I wonder if this 0.0001% chance of losing a PG within the hour following a disk failure matters or if it is dominated by other factors. What do you think ?

I wouldn't expect that number to be dominated by the chances of
total-loss/godzilla events, but I'm no datacentre reliability guru (at
least we don't have Godzilla here in Melbourne yet anyway). I couldn't
very quickly find any stats on "one-in-one-hundred year" events that
might actually destroy a datacentre. Availability is another question
altogether, which you probably know the Uptime Institute has specific
figures for tiers 1-4. But in my mind you should expect datacentre
power outages as an operational (rather than disaster) event, and
you'd want your Ceph cluster to survive them unscathed. If that
Copysets paper mentioned a while ago has any merit (see
http://hackingdistributed.com/2014/02/14/chainsets/ for more on that),
then it seems like the chances of drive loss following an availability
event are much higher than normal.

-- 
Cheers,
~Blairo