Best practice K/M-parameters EC pool

loic@xxxxxxxxxxx (Loic Dachary) · Tue, 26 Aug 2014 15:25:30 +0200

Hi Blair,

Assuming that:

* The pool is configured for three replicas (size = 3 which is the default)
* It takes one hour for Ceph to recover from the loss of a single OSD
* Any other disk has a 0.001% chance to fail within the hour following the failure of the first disk (assuming AFR https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 10%, divided by the number of hours during a year).
* A given disk does not participate in more than 100 PG

Each time an OSD is lost, there is a 0.001*0.001 = 0.000001% chance that two other disks are lost before recovery. Since the disk that failed initialy participates in 100 PG, that is 0.000001% x 100 = 0.0001% chance that a PG is lost. Or the entire pool if it is used in a way that loosing a PG means loosing all data in the pool (as in your example, where it contains RBD volumes and each of the RBD volume uses all the available PG).

If the pool is using at least two datacenters operated by two different organizations, this calculation makes sense to me. However, if the cluster is in a single datacenter, isn't it possible that some event independent of Ceph has a greater probability of permanently destroying the data ? A month ago I lost three machines in a Ceph cluster and realized on that occasion that the crushmap was not configured properly and that PG were lost as a result. Fortunately I was able to recover the disks and plug them in another machine to recover the lost PGs. I'm not a system administrator and the probability of me failing to do the right thing is higher than normal: this is just an example of a high probability event leading to data loss. In other words, I wonder if this 0.0001% chance of losing a PG within the hour following a disk failure matters or if it is dominated by other factors. What do you think ?

Cheers

On 26/08/2014 02:23, Blair Bethwaite wrote:
>> Message: 25
>> Date: Fri, 15 Aug 2014 15:06:49 +0200
>> From: Loic Dachary <loic at dachary.org>
>> To: Erik Logtenberg <erik at logtenberg.eu>, ceph-users at lists.ceph.com
>> Subject: Re: [ceph-users] Best practice K/M-parameters EC pool
>> Message-ID: <53EE05E9.1040105 at dachary.org>
>> Content-Type: text/plain; charset="iso-8859-1"
>> ...
>> Here is how I reason about it, roughly:
>>
>> If the probability of loosing a disk is 0.1%, the probability of loosing two disks simultaneously (i.e. before the failure can be recovered) would be 0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four disks becomes 0.0001%
> 
> I watched this conversation and an older similar one (Failure
> probability with largish deployments) with interest as we are in the
> process of planning a pretty large Ceph cluster (~3.5 PB), so I have
> been trying to wrap my head around these issues.
> 
> Loic's reasoning (above) seems sound as a naive approximation assuming
> independent probabilities for disk failures, which may not be quite
> true given potential for batch production issues, but should be okay
> for other sorts of correlations (assuming a sane crushmap that
> eliminates things like controllers and nodes as sources of
> correlation).
> 
> One of the things that came up in the "Failure probability with
> largish deployments" thread and has raised its head again here is the
> idea that striped data (e.g., RADOS-GW objects and RBD volumes) might
> be somehow more prone to data-loss than non-striped. I don't think
> anyone has so far provided an answer on this, so here's my thinking...
> 
> The level of atomicity that matters when looking at durability &
> availability in Ceph is the Placement Group. For any non-trivial RBD
> it is likely that many RBDs will span all/most PGs, e.g., even a
> relatively small 50GiB volume would (with default 4MiB object size)
> span 12800 PGs - more than there are in many production clusters
> obeying the 100-200 PGs per drive rule of thumb. <IMPORTANT>Losing any
> one PG will cause data-loss. The failure-probability effects of
> striping across multiple PGs are immaterial considering that loss of
> any single PG is likely to damage all your RBDs</IMPORTANT>. This
> might be why the reliability calculator doesn't consider total number
> of disks.
> 
> Related to all this is the durability of 2 versus 3 replicas (or e.g.
> M>=1 for Erasure Coding). It's easy to get caught up in the worrying
> fallacy that losing any M OSDs will cause data-loss, but this isn't
> true - they have to be members of the same PG for data-loss to occur.
> So then it's tempting to think the chances of that happening are so
> slim as to not matter and why would we ever even need 3 replicas. I
> mean, what are the odds of exactly those 2 drives, out of the
> 100,200... in my cluster, failing in <recovery window>?! But therein
> lays the rub - you should be thinking about PGs. If a drive fails then
> the chance of a data-loss event resulting are dependent on the chances
> of losing further drives from the affected/degraded PGs.
> 
> I've got a real cluster at hand, so let's use that as an example. We
> have 96 drives/OSDs - 8 nodes, 12 OSDs per node, 2 replicas, top-down
> failure domains: rack pairs (x2), nodes, OSDs... Let's say OSD 15
> dies. How many PGs are now at risk:
> $ grep "^10\." pg.dump | awk '{print $15}' | grep 15 | wc
>     109     109     861
> (NB: 10 is the pool id, pg.dump is a text file dump of "ceph pg dump",
> $15 is the acting set column)
> 
> 109 PGs now "living on the edge". No surprises in that number as we
> used 100 * 96 / 2 = 4800 to arrive at the PG count for this pool, so
> on average any one OSD will be primary for 50 PGs and replica for
> another 50. But this doesn't tell me how exposed I am, for that I need
> to know how many "neighbouring" OSDs there are in these 109 PGs:
> $ grep "^10\." pg.dump | awk '{print $15}' | grep 15 | sed
> 's/\[15,\(.*\)\]/\1/' | sed 's/\[\(.*\),15\]/\1/' | sort | uniq | wc
>      67      67     193
> (NB: grep-ing for OSD "15" and using sed to remove it and surrounding
> formatting to get just the neighbour id)
> 
> Yikes! So if any one of those 67 drives fails during recovery of OSD
> 15, then we've lost data. On average we should expect this to be
> determined by our crushmap, which in this case splits the cluster up
> into 2 top-level failure domains, so I'd guess it's the probability of
> 1 in 48 drives failing on average for this cluster. But actually
> looking at the numbers for each OSD it is higher than that here - the
> lowest distinct "neighbour" count we have is 50. Note that we haven't
> tuned any of the options in our crushmap, so I guess maybe Ceph
> favours fewer repeat sets by default when coming up with PGs(?).
> 
> Anyway, here's the average and top 10 neighbour counts (hope this
> scripting is right! ;-):
> 
> $ for OSD in {0..95}; do echo -ne "$OSD\t"; grep "^10\." pg.dump | awk
> '{print $15}' | grep "\[${OSD},\|,${OSD}\]" | sed
> "s/\[$OSD,\(.*\)\]/\1/" | sed "s/\[\(.*\),$OSD\]/\1/" | sort | uniq |
> wc -l; done | awk '{ total += $2 } END { print total/NR }'
> 58.5208
> 
> $ for OSD in {0..95}; do echo -ne "$OSD\t"; grep "^10\." pg.dump | awk
> '{print $15}' | grep "\[${OSD},\|,${OSD}\]" | sed
> "s/\[$OSD,\(.*\)\]/\1/" | sed "s/\[\(.*\),$OSD\]/\1/" | sort | uniq |
> wc -l; done | sort -k2 -r | head
> 78 69
> 37 68
> 92 67
> 15 67
> 91 66
> 66 65
> 61 65
> 89 64
> 88 64
> 87 64
> (OSD# Neighbour#)
> 
> So, if I am getting this right then at the end of the day __I think__
> all this essentially boils down (sans CRUSH) to the number of possible
> combinations (not permutations - order is irrelevant) of OSDs that can
> be chosen. Making these numbers smaller is only possible by increasing
> r in nCr:
> 96 choose 2 = 4560
> 96 choose 3 = 142880
> 
> So basically with two replicas, if _any_ two disks fail within your
> recovery window the chance of data-loss is high thanks to the chances
> of those OSDs intersecting in the concrete space of PGs represented in
> the pool. With three replicas that tapers off hugely as we're only
> utilising 4800 / 142880 * 100 ~= 3.5% of the potential PG space.
> 
> I guess to some extent that this holds true for M values in EC pools.
> 
> I hope some of this makes sense...? I'd love to see some of these
> questions answered canonically by Inktank or Sage, if not then perhaps
> I'll see how far I get sticking this diatribe into the ICE support
> portal...
> 

-- 
Lo?c Dachary, Artisan Logiciel Libre

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 263 bytes
Desc: OpenPGP digital signature
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140826/e4f9f0d3/attachment.pgp>