Hello, On Tue, 26 Aug 2014 10:23:43 +1000 Blair Bethwaite wrote: > > Message: 25 > > Date: Fri, 15 Aug 2014 15:06:49 +0200 > > From: Loic Dachary <loic at dachary.org> > > To: Erik Logtenberg <erik at logtenberg.eu>, ceph-users at lists.ceph.com > > Subject: Re: [ceph-users] Best practice K/M-parameters EC pool > > Message-ID: <53EE05E9.1040105 at dachary.org> > > Content-Type: text/plain; charset="iso-8859-1" > > ... > > Here is how I reason about it, roughly: > > > > If the probability of loosing a disk is 0.1%, the probability of > > loosing two disks simultaneously (i.e. before the failure can be > > recovered) would be 0.1*0.1 = 0.01% and three disks becomes > > 0.1*0.1*0.1 = 0.001% and four disks becomes 0.0001% > > I watched this conversation and an older similar one (Failure > probability with largish deployments) with interest as we are in the > process of planning a pretty large Ceph cluster (~3.5 PB), so I have > been trying to wrap my head around these issues. > As the OP of the "Failure probability with largish deployments" thread I have to thank Blair for raising this issue again and doing the hard math below. Which looks fine to me. At the end of that slightly inconclusive thread I walked away with the same impression as Blair, namely that the survival of PGs is the key factor and that they will likely be spread out over most, if not all the OSDs. Which in turn did reinforce my decision to deploy our first production Ceph cluster based on nodes with 2 OSDs backed by 11 disk RAID6 sets behind a HW RAID controller with 4GB cache AND SDD journals. I can live with the reduced performance (which is caused by the OSD code running out of steam long before the SSDs or the RAIDs do), because not only do I save 1/3rd of the space and 1/4th of the cost compared to a replication 3 cluster, the total of disks that need to fail within the recovery window to cause data loss is now 4. The next cluster I'm currently building is a classic Ceph design, replication of 3, 8 OSD HDDs and 4 journal SSDs per node, because with this cluster I won't have predictable I/O patterns and loads. OTOH, I don't see it growing much beyond 48 OSDs, so I'm happy enough with the odds here. I think doing the exact maths for a cluster of the size you're planning would be very interesting and also very much needed. 3.5PB usable space would be close to 3000 disks with a replication of 3, but even if you meant that as gross value it would probably mean that you're looking at frequent, if not daily disk failures. Regards, Christian > Loic's reasoning (above) seems sound as a naive approximation assuming > independent probabilities for disk failures, which may not be quite > true given potential for batch production issues, but should be okay > for other sorts of correlations (assuming a sane crushmap that > eliminates things like controllers and nodes as sources of > correlation). > > One of the things that came up in the "Failure probability with > largish deployments" thread and has raised its head again here is the > idea that striped data (e.g., RADOS-GW objects and RBD volumes) might > be somehow more prone to data-loss than non-striped. I don't think > anyone has so far provided an answer on this, so here's my thinking... > > The level of atomicity that matters when looking at durability & > availability in Ceph is the Placement Group. For any non-trivial RBD > it is likely that many RBDs will span all/most PGs, e.g., even a > relatively small 50GiB volume would (with default 4MiB object size) > span 12800 PGs - more than there are in many production clusters > obeying the 100-200 PGs per drive rule of thumb. <IMPORTANT>Losing any > one PG will cause data-loss. The failure-probability effects of > striping across multiple PGs are immaterial considering that loss of > any single PG is likely to damage all your RBDs</IMPORTANT>. This > might be why the reliability calculator doesn't consider total number > of disks. > > Related to all this is the durability of 2 versus 3 replicas (or e.g. > M>=1 for Erasure Coding). It's easy to get caught up in the worrying > fallacy that losing any M OSDs will cause data-loss, but this isn't > true - they have to be members of the same PG for data-loss to occur. > So then it's tempting to think the chances of that happening are so > slim as to not matter and why would we ever even need 3 replicas. I > mean, what are the odds of exactly those 2 drives, out of the > 100,200... in my cluster, failing in <recovery window>?! But therein > lays the rub - you should be thinking about PGs. If a drive fails then > the chance of a data-loss event resulting are dependent on the chances > of losing further drives from the affected/degraded PGs. > > I've got a real cluster at hand, so let's use that as an example. We > have 96 drives/OSDs - 8 nodes, 12 OSDs per node, 2 replicas, top-down > failure domains: rack pairs (x2), nodes, OSDs... Let's say OSD 15 > dies. How many PGs are now at risk: > $ grep "^10\." pg.dump | awk '{print $15}' | grep 15 | wc > 109 109 861 > (NB: 10 is the pool id, pg.dump is a text file dump of "ceph pg dump", > $15 is the acting set column) > > 109 PGs now "living on the edge". No surprises in that number as we > used 100 * 96 / 2 = 4800 to arrive at the PG count for this pool, so > on average any one OSD will be primary for 50 PGs and replica for > another 50. But this doesn't tell me how exposed I am, for that I need > to know how many "neighbouring" OSDs there are in these 109 PGs: > $ grep "^10\." pg.dump | awk '{print $15}' | grep 15 | sed > 's/\[15,\(.*\)\]/\1/' | sed 's/\[\(.*\),15\]/\1/' | sort | uniq | wc > 67 67 193 > (NB: grep-ing for OSD "15" and using sed to remove it and surrounding > formatting to get just the neighbour id) > > Yikes! So if any one of those 67 drives fails during recovery of OSD > 15, then we've lost data. On average we should expect this to be > determined by our crushmap, which in this case splits the cluster up > into 2 top-level failure domains, so I'd guess it's the probability of > 1 in 48 drives failing on average for this cluster. But actually > looking at the numbers for each OSD it is higher than that here - the > lowest distinct "neighbour" count we have is 50. Note that we haven't > tuned any of the options in our crushmap, so I guess maybe Ceph > favours fewer repeat sets by default when coming up with PGs(?). > > Anyway, here's the average and top 10 neighbour counts (hope this > scripting is right! ;-): > > $ for OSD in {0..95}; do echo -ne "$OSD\t"; grep "^10\." pg.dump | awk > '{print $15}' | grep "\[${OSD},\|,${OSD}\]" | sed > "s/\[$OSD,\(.*\)\]/\1/" | sed "s/\[\(.*\),$OSD\]/\1/" | sort | uniq | > wc -l; done | awk '{ total += $2 } END { print total/NR }' > 58.5208 > > $ for OSD in {0..95}; do echo -ne "$OSD\t"; grep "^10\." pg.dump | awk > '{print $15}' | grep "\[${OSD},\|,${OSD}\]" | sed > "s/\[$OSD,\(.*\)\]/\1/" | sed "s/\[\(.*\),$OSD\]/\1/" | sort | uniq | > wc -l; done | sort -k2 -r | head > 78 69 > 37 68 > 92 67 > 15 67 > 91 66 > 66 65 > 61 65 > 89 64 > 88 64 > 87 64 > (OSD# Neighbour#) > > So, if I am getting this right then at the end of the day __I think__ > all this essentially boils down (sans CRUSH) to the number of possible > combinations (not permutations - order is irrelevant) of OSDs that can > be chosen. Making these numbers smaller is only possible by increasing > r in nCr: > 96 choose 2 = 4560 > 96 choose 3 = 142880 > > So basically with two replicas, if _any_ two disks fail within your > recovery window the chance of data-loss is high thanks to the chances > of those OSDs intersecting in the concrete space of PGs represented in > the pool. With three replicas that tapers off hugely as we're only > utilising 4800 / 142880 * 100 ~= 3.5% of the potential PG space. > > I guess to some extent that this holds true for M values in EC pools. > > I hope some of this makes sense...? I'd love to see some of these > questions answered canonically by Inktank or Sage, if not then perhaps > I'll see how far I get sticking this diatribe into the ICE support > portal... > -- Christian Balzer Network/Systems Engineer chibi at gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/