> Message: 25 > Date: Fri, 15 Aug 2014 15:06:49 +0200 > From: Loic Dachary <loic at dachary.org> > To: Erik Logtenberg <erik at logtenberg.eu>, ceph-users at lists.ceph.com > Subject: Re: [ceph-users] Best practice K/M-parameters EC pool > Message-ID: <53EE05E9.1040105 at dachary.org> > Content-Type: text/plain; charset="iso-8859-1" > ... > Here is how I reason about it, roughly: > > If the probability of loosing a disk is 0.1%, the probability of loosing two disks simultaneously (i.e. before the failure can be recovered) would be 0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four disks becomes 0.0001% I watched this conversation and an older similar one (Failure probability with largish deployments) with interest as we are in the process of planning a pretty large Ceph cluster (~3.5 PB), so I have been trying to wrap my head around these issues. Loic's reasoning (above) seems sound as a naive approximation assuming independent probabilities for disk failures, which may not be quite true given potential for batch production issues, but should be okay for other sorts of correlations (assuming a sane crushmap that eliminates things like controllers and nodes as sources of correlation). One of the things that came up in the "Failure probability with largish deployments" thread and has raised its head again here is the idea that striped data (e.g., RADOS-GW objects and RBD volumes) might be somehow more prone to data-loss than non-striped. I don't think anyone has so far provided an answer on this, so here's my thinking... The level of atomicity that matters when looking at durability & availability in Ceph is the Placement Group. For any non-trivial RBD it is likely that many RBDs will span all/most PGs, e.g., even a relatively small 50GiB volume would (with default 4MiB object size) span 12800 PGs - more than there are in many production clusters obeying the 100-200 PGs per drive rule of thumb. <IMPORTANT>Losing any one PG will cause data-loss. The failure-probability effects of striping across multiple PGs are immaterial considering that loss of any single PG is likely to damage all your RBDs</IMPORTANT>. This might be why the reliability calculator doesn't consider total number of disks. Related to all this is the durability of 2 versus 3 replicas (or e.g. M>=1 for Erasure Coding). It's easy to get caught up in the worrying fallacy that losing any M OSDs will cause data-loss, but this isn't true - they have to be members of the same PG for data-loss to occur. So then it's tempting to think the chances of that happening are so slim as to not matter and why would we ever even need 3 replicas. I mean, what are the odds of exactly those 2 drives, out of the 100,200... in my cluster, failing in <recovery window>?! But therein lays the rub - you should be thinking about PGs. If a drive fails then the chance of a data-loss event resulting are dependent on the chances of losing further drives from the affected/degraded PGs. I've got a real cluster at hand, so let's use that as an example. We have 96 drives/OSDs - 8 nodes, 12 OSDs per node, 2 replicas, top-down failure domains: rack pairs (x2), nodes, OSDs... Let's say OSD 15 dies. How many PGs are now at risk: $ grep "^10\." pg.dump | awk '{print $15}' | grep 15 | wc 109 109 861 (NB: 10 is the pool id, pg.dump is a text file dump of "ceph pg dump", $15 is the acting set column) 109 PGs now "living on the edge". No surprises in that number as we used 100 * 96 / 2 = 4800 to arrive at the PG count for this pool, so on average any one OSD will be primary for 50 PGs and replica for another 50. But this doesn't tell me how exposed I am, for that I need to know how many "neighbouring" OSDs there are in these 109 PGs: $ grep "^10\." pg.dump | awk '{print $15}' | grep 15 | sed 's/\[15,\(.*\)\]/\1/' | sed 's/\[\(.*\),15\]/\1/' | sort | uniq | wc 67 67 193 (NB: grep-ing for OSD "15" and using sed to remove it and surrounding formatting to get just the neighbour id) Yikes! So if any one of those 67 drives fails during recovery of OSD 15, then we've lost data. On average we should expect this to be determined by our crushmap, which in this case splits the cluster up into 2 top-level failure domains, so I'd guess it's the probability of 1 in 48 drives failing on average for this cluster. But actually looking at the numbers for each OSD it is higher than that here - the lowest distinct "neighbour" count we have is 50. Note that we haven't tuned any of the options in our crushmap, so I guess maybe Ceph favours fewer repeat sets by default when coming up with PGs(?). Anyway, here's the average and top 10 neighbour counts (hope this scripting is right! ;-): $ for OSD in {0..95}; do echo -ne "$OSD\t"; grep "^10\." pg.dump | awk '{print $15}' | grep "\[${OSD},\|,${OSD}\]" | sed "s/\[$OSD,\(.*\)\]/\1/" | sed "s/\[\(.*\),$OSD\]/\1/" | sort | uniq | wc -l; done | awk '{ total += $2 } END { print total/NR }' 58.5208 $ for OSD in {0..95}; do echo -ne "$OSD\t"; grep "^10\." pg.dump | awk '{print $15}' | grep "\[${OSD},\|,${OSD}\]" | sed "s/\[$OSD,\(.*\)\]/\1/" | sed "s/\[\(.*\),$OSD\]/\1/" | sort | uniq | wc -l; done | sort -k2 -r | head 78 69 37 68 92 67 15 67 91 66 66 65 61 65 89 64 88 64 87 64 (OSD# Neighbour#) So, if I am getting this right then at the end of the day __I think__ all this essentially boils down (sans CRUSH) to the number of possible combinations (not permutations - order is irrelevant) of OSDs that can be chosen. Making these numbers smaller is only possible by increasing r in nCr: 96 choose 2 = 4560 96 choose 3 = 142880 So basically with two replicas, if _any_ two disks fail within your recovery window the chance of data-loss is high thanks to the chances of those OSDs intersecting in the concrete space of PGs represented in the pool. With three replicas that tapers off hugely as we're only utilising 4800 / 142880 * 100 ~= 3.5% of the potential PG space. I guess to some extent that this holds true for M values in EC pools. I hope some of this makes sense...? I'd love to see some of these questions answered canonically by Inktank or Sage, if not then perhaps I'll see how far I get sticking this diatribe into the ICE support portal... -- Cheers, ~Blairo