Best practice K/M-parameters EC pool

blair.bethwaite@xxxxxxxxx (Blair Bethwaite) · Tue, 26 Aug 2014 10:23:43 +1000

> Message: 25
> Date: Fri, 15 Aug 2014 15:06:49 +0200
> From: Loic Dachary <loic at dachary.org>
> To: Erik Logtenberg <erik at logtenberg.eu>, ceph-users at lists.ceph.com
> Subject: Re: [ceph-users] Best practice K/M-parameters EC pool
> Message-ID: <53EE05E9.1040105 at dachary.org>
> Content-Type: text/plain; charset="iso-8859-1"
> ...
> Here is how I reason about it, roughly:
>
> If the probability of loosing a disk is 0.1%, the probability of loosing two disks simultaneously (i.e. before the failure can be recovered) would be 0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four disks becomes 0.0001%

I watched this conversation and an older similar one (Failure
probability with largish deployments) with interest as we are in the
process of planning a pretty large Ceph cluster (~3.5 PB), so I have
been trying to wrap my head around these issues.

Loic's reasoning (above) seems sound as a naive approximation assuming
independent probabilities for disk failures, which may not be quite
true given potential for batch production issues, but should be okay
for other sorts of correlations (assuming a sane crushmap that
eliminates things like controllers and nodes as sources of
correlation).

One of the things that came up in the "Failure probability with
largish deployments" thread and has raised its head again here is the
idea that striped data (e.g., RADOS-GW objects and RBD volumes) might
be somehow more prone to data-loss than non-striped. I don't think
anyone has so far provided an answer on this, so here's my thinking...

The level of atomicity that matters when looking at durability &
availability in Ceph is the Placement Group. For any non-trivial RBD
it is likely that many RBDs will span all/most PGs, e.g., even a
relatively small 50GiB volume would (with default 4MiB object size)
span 12800 PGs - more than there are in many production clusters
obeying the 100-200 PGs per drive rule of thumb. <IMPORTANT>Losing any
one PG will cause data-loss. The failure-probability effects of
striping across multiple PGs are immaterial considering that loss of
any single PG is likely to damage all your RBDs</IMPORTANT>. This
might be why the reliability calculator doesn't consider total number
of disks.

Related to all this is the durability of 2 versus 3 replicas (or e.g.
M>=1 for Erasure Coding). It's easy to get caught up in the worrying
fallacy that losing any M OSDs will cause data-loss, but this isn't
true - they have to be members of the same PG for data-loss to occur.
So then it's tempting to think the chances of that happening are so
slim as to not matter and why would we ever even need 3 replicas. I
mean, what are the odds of exactly those 2 drives, out of the
100,200... in my cluster, failing in <recovery window>?! But therein
lays the rub - you should be thinking about PGs. If a drive fails then
the chance of a data-loss event resulting are dependent on the chances
of losing further drives from the affected/degraded PGs.

I've got a real cluster at hand, so let's use that as an example. We
have 96 drives/OSDs - 8 nodes, 12 OSDs per node, 2 replicas, top-down
failure domains: rack pairs (x2), nodes, OSDs... Let's say OSD 15
dies. How many PGs are now at risk:
$ grep "^10\." pg.dump | awk '{print $15}' | grep 15 | wc
    109     109     861
(NB: 10 is the pool id, pg.dump is a text file dump of "ceph pg dump",
$15 is the acting set column)

109 PGs now "living on the edge". No surprises in that number as we
used 100 * 96 / 2 = 4800 to arrive at the PG count for this pool, so
on average any one OSD will be primary for 50 PGs and replica for
another 50. But this doesn't tell me how exposed I am, for that I need
to know how many "neighbouring" OSDs there are in these 109 PGs:
$ grep "^10\." pg.dump | awk '{print $15}' | grep 15 | sed
's/\[15,\(.*\)\]/\1/' | sed 's/\[\(.*\),15\]/\1/' | sort | uniq | wc
     67      67     193
(NB: grep-ing for OSD "15" and using sed to remove it and surrounding
formatting to get just the neighbour id)

Yikes! So if any one of those 67 drives fails during recovery of OSD
15, then we've lost data. On average we should expect this to be
determined by our crushmap, which in this case splits the cluster up
into 2 top-level failure domains, so I'd guess it's the probability of
1 in 48 drives failing on average for this cluster. But actually
looking at the numbers for each OSD it is higher than that here - the
lowest distinct "neighbour" count we have is 50. Note that we haven't
tuned any of the options in our crushmap, so I guess maybe Ceph
favours fewer repeat sets by default when coming up with PGs(?).

Anyway, here's the average and top 10 neighbour counts (hope this
scripting is right! ;-):

$ for OSD in {0..95}; do echo -ne "$OSD\t"; grep "^10\." pg.dump | awk
'{print $15}' | grep "\[${OSD},\|,${OSD}\]" | sed
"s/\[$OSD,\(.*\)\]/\1/" | sed "s/\[\(.*\),$OSD\]/\1/" | sort | uniq |
wc -l; done | awk '{ total += $2 } END { print total/NR }'
58.5208

$ for OSD in {0..95}; do echo -ne "$OSD\t"; grep "^10\." pg.dump | awk
'{print $15}' | grep "\[${OSD},\|,${OSD}\]" | sed
"s/\[$OSD,\(.*\)\]/\1/" | sed "s/\[\(.*\),$OSD\]/\1/" | sort | uniq |
wc -l; done | sort -k2 -r | head
78 69
37 68
92 67
15 67
91 66
66 65
61 65
89 64
88 64
87 64
(OSD# Neighbour#)

So, if I am getting this right then at the end of the day __I think__
all this essentially boils down (sans CRUSH) to the number of possible
combinations (not permutations - order is irrelevant) of OSDs that can
be chosen. Making these numbers smaller is only possible by increasing
r in nCr:
96 choose 2 = 4560
96 choose 3 = 142880

So basically with two replicas, if _any_ two disks fail within your
recovery window the chance of data-loss is high thanks to the chances
of those OSDs intersecting in the concrete space of PGs represented in
the pool. With three replicas that tapers off hugely as we're only
utilising 4800 / 142880 * 100 ~= 3.5% of the potential PG space.

I guess to some extent that this holds true for M values in EC pools.

I hope some of this makes sense...? I'd love to see some of these
questions answered canonically by Inktank or Sage, if not then perhaps
I'll see how far I get sticking this diatribe into the ICE support
portal...

-- 
Cheers,
~Blairo