Best practice K/M-parameters EC pool

chibi@xxxxxxx (Christian Balzer) · Wed, 27 Aug 2014 12:33:19 +0900

Hello,

On Tue, 26 Aug 2014 16:12:11 +0200 Loic Dachary wrote:

> Using percentages instead of numbers lead me to calculations errors.
> Here it is again using 1/100 instead of % for clarity ;-)
> 
> Assuming that:
> 
> * The pool is configured for three replicas (size = 3 which is the
> default)
> * It takes one hour for Ceph to recover from the loss of a single OSD
I think Craig and I have debunked that number.
It will be something like "that depends on many things starting with the
amount of data, the disk speeds, the contention (client and other ops),
the network speed/utilization, the actual OSD process and queue handling
speed, etc.".
If you want to make an assumption that's not an order of magnitude wrong,
start with 24 hours.

It would be nice to hear from people with really huge clusters like Dan at
CERN how their recovery speeds are.

> * Any other disk has a 1/100,000 chance to fail within the hour
> following the failure of the first disk (assuming AFR
> https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is
> 8%, divided by the number of hours during a year == (0.08 / 8760) ~=
> 1/100,000 
> * A given disk does not participate in more than 100 PG
> 
You will find that the smaller the cluster, the more likely it is to be
higher than 100, due to rounding up or just upping things because the
distribution is too uneven otherwise.

> Each time an OSD is lost, there is a 1/100,000*1/100,000 =
> 1/10,000,000,000 chance that two other disks are lost before recovery.
> Since the disk that failed initialy participates in 100 PG, that is
> 1/10,000,000,000 x 100 = 1/100,000,000 chance that a PG is lost. Or the
> entire pool if it is used in a way that loosing a PG means loosing all
> data in the pool (as in your example, where it contains RBD volumes and
> each of the RBD volume uses all the available PG).
> 
> If the pool is using at least two datacenters operated by two different
> organizations, this calculation makes sense to me. However, if the
> cluster is in a single datacenter, isn't it possible that some event
> independent of Ceph has a greater probability of permanently destroying
> the data ? A month ago I lost three machines in a Ceph cluster and
> realized on that occasion that the crushmap was not configured properly
> and that PG were lost as a result. Fortunately I was able to recover the
> disks and plug them in another machine to recover the lost PGs. I'm not
> a system administrator and the probability of me failing to do the right
> thing is higher than normal: this is just an example of a high
> probability event leading to data loss. Another example would be if all
> disks in the same PG are part of the same batch and therefore likely to
> fail at the same time. In other words, I wonder if this 0.0001% chance
> of losing a PG within the hour following a disk failure matters or if it
> is dominate d by other factors. What do you think ?
>

Batch failures are real, I'm seeing that all the time. 
However they tend to be still spaced out widely enough most of the time.
Still something to consider in a complete calculation.

As for failures other than disks, these tend to be recoverable, as you
experienced yourself. A node, rack, whatever failure might make your
cluster temporarily inaccessible (and thus should be avoided by proper
CRUSH maps and other precautions), but it will not lead to actual data
loss.

Regards,

Christian

> Cheers
> 
> > 
> > Assuming that:
> > 
> > * The pool is configured for three replicas (size = 3 which is the
> > default)
> > * It takes one hour for Ceph to recover from the loss of a single OSD
> > * Any other disk has a 0.001% chance to fail within the hour following
> > the failure of the first disk (assuming AFR
> > https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is
> > 10%, divided by the number of hours during a year).
> > * A given disk does not participate in more than 100 PG
> > 
> > Each time an OSD is lost, there is a 0.001*0.001 = 0.000001% chance
> > that two other disks are lost before recovery. Since the disk that
> > failed initialy participates in 100 PG, that is 0.000001% x 100 =
> > 0.0001% chance that a PG is lost. Or the entire pool if it is used in
> > a way that loosing a PG means loosing all data in the pool (as in your
> > example, where it contains RBD volumes and each of the RBD volume uses
> > all the available PG).
> > 
> > If the pool is using at least two datacenters operated by two
> > different organizations, this calculation makes sense to me. However,
> > if the cluster is in a single datacenter, isn't it possible that some
> > event independent of Ceph has a greater probability of permanently
> > destroying the data ? A month ago I lost three machines in a Ceph
> > cluster and realized on that occasion that the crushmap was not
> > configured properly and that PG were lost as a result. Fortunately I
> > was able to recover the disks and plug them in another machine to
> > recover the lost PGs. I'm not a system administrator and the
> > probability of me failing to do the right thing is higher than normal:
> > this is just an example of a high probability event leading to data
> > loss. In other words, I wonder if this 0.0001% chance of losing a PG
> > within the hour following a disk failure matters or if it is dominated
> > by other factors. What do you think ?
> > 
> > Cheers
> 
> On 26/08/2014 15:25, Loic Dachary wrote:> Hi Blair,
> > 
> > Assuming that:
> > 
> > * The pool is configured for three replicas (size = 3 which is the
> > default)
> > * It takes one hour for Ceph to recover from the loss of a single OSD
> > * Any other disk has a 0.001% chance to fail within the hour following
> > the failure of the first disk (assuming AFR
> > https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is
> > 10%, divided by the number of hours during a year).
> > * A given disk does not participate in more than 100 PG
> > 
> > Each time an OSD is lost, there is a 0.001*0.001 = 0.000001% chance
> > that two other disks are lost before recovery. Since the disk that
> > failed initialy participates in 100 PG, that is 0.000001% x 100 =
> > 0.0001% chance that a PG is lost. Or the entire pool if it is used in
> > a way that loosing a PG means loosing all data in the pool (as in your
> > example, where it contains RBD volumes and each of the RBD volume uses
> > all the available PG).
> > 
> > If the pool is using at least two datacenters operated by two
> > different organizations, this calculation makes sense to me. However,
> > if the cluster is in a single datacenter, isn't it possible that some
> > event independent of Ceph has a greater probability of permanently
> > destroying the data ? A month ago I lost three machines in a Ceph
> > cluster and realized on that occasion that the crushmap was not
> > configured properly and that PG were lost as a result. Fortunately I
> > was able to recover the disks and plug them in another machine to
> > recover the lost PGs. I'm not a system administrator and the
> > probability of me failing to do the right thing is higher than normal:
> > this is just an example of a high probability event leading to data
> > loss. In other words, I wonder if this 0.0001% chance of losing a PG
> > within the hour following a disk failure matters or if it is dominated
> > by other factors. What do you think ?
> > 
> > Cheers
> > 
> > On 26/08/2014 02:23, Blair Bethwaite wrote:
> >>> Message: 25
> >>> Date: Fri, 15 Aug 2014 15:06:49 +0200
> >>> From: Loic Dachary <loic at dachary.org>
> >>> To: Erik Logtenberg <erik at logtenberg.eu>, ceph-users at lists.ceph.com
> >>> Subject: Re: [ceph-users] Best practice K/M-parameters EC pool
> >>> Message-ID: <53EE05E9.1040105 at dachary.org>
> >>> Content-Type: text/plain; charset="iso-8859-1"
> >>> ...
> >>> Here is how I reason about it, roughly:
> >>>
> >>> If the probability of loosing a disk is 0.1%, the probability of
> >>> loosing two disks simultaneously (i.e. before the failure can be
> >>> recovered) would be 0.1*0.1 = 0.01% and three disks becomes
> >>> 0.1*0.1*0.1 = 0.001% and four disks becomes 0.0001%
> >>
> >> I watched this conversation and an older similar one (Failure
> >> probability with largish deployments) with interest as we are in the
> >> process of planning a pretty large Ceph cluster (~3.5 PB), so I have
> >> been trying to wrap my head around these issues.
> >>
> >> Loic's reasoning (above) seems sound as a naive approximation assuming
> >> independent probabilities for disk failures, which may not be quite
> >> true given potential for batch production issues, but should be okay
> >> for other sorts of correlations (assuming a sane crushmap that
> >> eliminates things like controllers and nodes as sources of
> >> correlation).
> >>
> >> One of the things that came up in the "Failure probability with
> >> largish deployments" thread and has raised its head again here is the
> >> idea that striped data (e.g., RADOS-GW objects and RBD volumes) might
> >> be somehow more prone to data-loss than non-striped. I don't think
> >> anyone has so far provided an answer on this, so here's my thinking...
> >>
> >> The level of atomicity that matters when looking at durability &
> >> availability in Ceph is the Placement Group. For any non-trivial RBD
> >> it is likely that many RBDs will span all/most PGs, e.g., even a
> >> relatively small 50GiB volume would (with default 4MiB object size)
> >> span 12800 PGs - more than there are in many production clusters
> >> obeying the 100-200 PGs per drive rule of thumb. <IMPORTANT>Losing any
> >> one PG will cause data-loss. The failure-probability effects of
> >> striping across multiple PGs are immaterial considering that loss of
> >> any single PG is likely to damage all your RBDs</IMPORTANT>. This
> >> might be why the reliability calculator doesn't consider total number
> >> of disks.
> >>
> >> Related to all this is the durability of 2 versus 3 replicas (or e.g.
> >> M>=1 for Erasure Coding). It's easy to get caught up in the worrying
> >> fallacy that losing any M OSDs will cause data-loss, but this isn't
> >> true - they have to be members of the same PG for data-loss to occur.
> >> So then it's tempting to think the chances of that happening are so
> >> slim as to not matter and why would we ever even need 3 replicas. I
> >> mean, what are the odds of exactly those 2 drives, out of the
> >> 100,200... in my cluster, failing in <recovery window>?! But therein
> >> lays the rub - you should be thinking about PGs. If a drive fails then
> >> the chance of a data-loss event resulting are dependent on the chances
> >> of losing further drives from the affected/degraded PGs.
> >>
> >> I've got a real cluster at hand, so let's use that as an example. We
> >> have 96 drives/OSDs - 8 nodes, 12 OSDs per node, 2 replicas, top-down
> >> failure domains: rack pairs (x2), nodes, OSDs... Let's say OSD 15
> >> dies. How many PGs are now at risk:
> >> $ 
> >>     109     109     861
> >> (NB: 10 is the pool id, pg.dump is a text file dump of "ceph pg dump",
> >> $15 is the acting set column)
> >>
> >> 109 PGs now "living on the edge". No surprises in that number as we
> >> used 100 * 96 / 2 = 4800 to arrive at the PG count for this pool, so
> >> on average any one OSD will be primary for 50 PGs and replica for
> >> another 50. But this doesn't tell me how exposed I am, for that I need
> >> to know how many "neighbouring" OSDs there are in these 109 PGs:
> >> $ grep "^10\." pg.dump | awk '{print $15}' | grep 15 | sed
> >> 's/\[15,\(.*\)\]/\1/' | sed 's/\[\(.*\),15\]/\1/' | sort | uniq | wc
> >>      67      67     193
> >> (NB: grep-ing for OSD "15" and using sed to remove it and surrounding
> >> formatting to get just the neighbour id)
> >>
> >> Yikes! So if any one of those 67 drives fails during recovery of OSD
> >> 15, then we've lost data. On average we should expect this to be
> >> determined by our crushmap, which in this case splits the cluster up
> >> into 2 top-level failure domains, so I'd guess it's the probability of
> >> 1 in 48 drives failing on average for this cluster. But actually
> >> looking at the numbers for each OSD it is higher than that here - the
> >> lowest distinct "neighbour" count we have is 50. Note that we haven't
> >> tuned any of the options in our crushmap, so I guess maybe Ceph
> >> favours fewer repeat sets by default when coming up with PGs(?).
> >>
> >> Anyway, here's the average and top 10 neighbour counts (hope this
> >> scripting is right! ;-):
> >>
> >> $ for OSD in {0..95}; do echo -ne "$OSD\t"; grep "^10\." pg.dump | awk
> >> '{print $15}' | grep "\[${OSD},\|,${OSD}\]" | sed
> >> "s/\[$OSD,\(.*\)\]/\1/" | sed "s/\[\(.*\),$OSD\]/\1/" | sort | uniq |
> >> wc -l; done | awk '{ total += $2 } END { print total/NR }'
> >> 58.5208
> >>
> >> $ for OSD in {0..95}; do echo -ne "$OSD\t"; grep "^10\." pg.dump | awk
> >> '{print $15}' | grep "\[${OSD},\|,${OSD}\]" | sed
> >> "s/\[$OSD,\(.*\)\]/\1/" | sed "s/\[\(.*\),$OSD\]/\1/" | sort | uniq |
> >> wc -l; done | sort -k2 -r | head
> >> 78 69
> >> 37 68
> >> 92 67
> >> 15 67
> >> 91 66
> >> 66 65
> >> 61 65
> >> 89 64
> >> 88 64
> >> 87 64
> >> (OSD# Neighbour#)
> >>
> >> So, if I am getting this right then at the end of the day __I think__
> >> all this essentially boils down (sans CRUSH) to the number of possible
> >> combinations (not permutations - order is irrelevant) of OSDs that can
> >> be chosen. Making these numbers smaller is only possible by increasing
> >> r in nCr:
> >> 96 choose 2 = 4560
> >> 96 choose 3 = 142880
> >>
> >> So basically with two replicas, if _any_ two disks fail within your
> >> recovery window the chance of data-loss is high thanks to the chances
> >> of those OSDs intersecting in the concrete space of PGs represented in
> >> the pool. With three replicas that tapers off hugely as we're only
> >> utilising 4800 / 142880 * 100 ~= 3.5% of the potential PG space.
> >>
> >> I guess to some extent that this holds true for M values in EC pools.
> >>
> >> I hope some of this makes sense...? I'd love to see some of these
> >> questions answered canonically by Inktank or Sage, if not then perhaps
> >> I'll see how far I get sticking this diatribe into the ICE support
> >> portal...
> >>
> > 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/