Re: calculating maximum number of disk and node failure that can be handled by cluster with out data loss

Christian Balzer <chibi@xxxxxxx> · Thu, 11 Jun 2015 09:37:27 +0900

Hello,

On Wed, 10 Jun 2015 23:53:48 +0300 Vasiliy Angapov wrote:

> Hi,
> 
> I also wrote a simple script which calculates the data loss probabilities
> for triple disk failure. Here are some numbers:
> OSDs: 10,   Pr: 138.89%
> OSDs: 20,   Pr: 29.24%
> OSDs: 30,   Pr: 12.32%
> OSDs: 40,   Pr: 6.75%
> OSDs: 50,   Pr: 4.25%
> OSDs: 100, Pr: 1.03%
> OSDs: 200, Pr: 0.25%
> OSDs: 500, Pr: 0.04%
> 
Nice, good to have some numbers.

> Here i assumed we have 100PGs per OSD. Also there is a constraint for 3
> disks not to be in one host because this will not lead to a failure. For
> situation where all disks are evenly distributed between 10 hosts it
> gives us a correction coefficient of 83% so for 50 OSDs it will be
> something like 3.53% instead of 4.25%.
> 
> There is a further constraint for 2 disks in one host and 1 disk on
> another but that's just adds unneeded complexity. Numbers will not change
> significantly.
> And actually triple simultaneous failure is itself not very likely to
> happen, so i believe that starting from 100 OSDs we can somewhat relax
> about data  failure.
> 
I mentioned the link below before, I found that to be one of the more
believable RAID failure calculators and they explain their shortcomings
nicely to boot. 
I usually half their DLO/year values (double the chance of data loss) to
be on the safe side: https://www.memset.com/tools/raid-calculator/

If you plunk in a 100 disk RAID6 (the equivalent of replica 3) and 2TB per
disk with a recovery rate of 100MB/s the odds are indeed pretty good. 
But note the expected disk failure rate of one per 10 days!

Of course the the biggest variable here is how fast your recovery speed
will be. I picked 100MB/s, because for some people that will be as fast as
their network goes. For others their network could be 10-40 times as
fast, but their cluster might not have enough OSDs (or fast enough ones) to
remain usable at those speeds, so they'll opt for for lower priority
recovery speeds.

Christian

> BTW, this presentation has more math
> http://www.slideshare.net/kioecn/build-an-highperformance-and-highdurable-block-storage-service-based-on-ceph
> 
> Regards, Vasily.
> 
> On Wed, Jun 10, 2015 at 12:38 PM, Dan van der Ster <dan@xxxxxxxxxxxxxx>
> wrote:
> 
> > OK I wrote a quick script to simulate triple failures and count how
> > many would have caused data loss. The script gets your list of OSDs
> > and PGs, then simulates failures and checks if any permutation of that
> > failure matches a PG.
> >
> > Here's an example with 10000 simulations on our production cluster:
> >
> > # ./simulate-failures.py
> > We have 1232 OSDs and 21056 PGs, hence 21056 combinations e.g. like
> > this: (945, 910, 399)
> > Simulating 10000 failures
> > Simulated 1000 triple failures. Data loss incidents = 0
> > Data loss incident with failure (676, 451, 931)
> > Simulated 2000 triple failures. Data loss incidents = 1
> > Simulated 3000 triple failures. Data loss incidents = 1
> > Simulated 4000 triple failures. Data loss incidents = 1
> > Simulated 5000 triple failures. Data loss incidents = 1
> > Simulated 6000 triple failures. Data loss incidents = 1
> > Simulated 7000 triple failures. Data loss incidents = 1
> > Simulated 8000 triple failures. Data loss incidents = 1
> > Data loss incident with failure (1031, 1034, 806)
> > Data loss incident with failure (449, 644, 329)
> > Simulated 9000 triple failures. Data loss incidents = 3
> > Simulated 10000 triple failures. Data loss incidents = 3
> >
> > End of simulation: Out of 10000 triple failures, 3 caused a data loss
> > incident
> >
> >
> > The script is here:
> >
> > https://github.com/cernceph/ceph-scripts/blob/master/tools/durability/simulate-failures.py
> > Give it a try (on your test clusters!)
> >
> > Cheers, Dan
> >
> >
> >
> >
> >
> > On Wed, Jun 10, 2015 at 10:47 AM, Jan Schermer <jan@xxxxxxxxxxx> wrote:
> > > Yeah, I know but I believe it was fixed so that a single copy is
> > sufficient for recovery now (even with min_size=1)? Depends on what you
> > want to achieve...
> > >
> > > The point is that even if we lost “just” 1% of data, that’s too much
> > (>0%) when talking about customer data, and I know from experience that
> > some volumes are unavailable when I lose 3 OSDs -  and I don’t have
> > that many volumes...
> > >
> > > Jan
> > >
> > >> On 10 Jun 2015, at 10:40, Dan van der Ster <dan@xxxxxxxxxxxxxx>
> > >> wrote:
> > >>
> > >> I'm not a mathematician, but I'm pretty sure there are 200 choose 3
> > >> = 1.3 million ways you can have 3 disks fail out of 200. nPGs =
> > >> 16384 so that many combinations would cause data loss. So I think
> > >> 1.2% of triple disk failures would lead to data loss. There might
> > >> be another factor of 3! that needs to be applied to nPGs -- I'm
> > >> currently thinking about that.
> > >> But you're right, if indeed you do ever lose an entire PG, _every_
> > >> RBD device will have random holes in their data, like swiss cheese.
> > >>
> > >> BTW PGs can have stuck IOs without losing all three replicas -- see
> > min_size.
> > >>
> > >> Cheers, Dan
> > >>
> > >> On Wed, Jun 10, 2015 at 10:20 AM, Jan Schermer <jan@xxxxxxxxxxx>
> > >> wrote:
> > >>> When you increase the number of OSDs, you generaly would (and
> > >>> should)
> > increase the number of PGs. For us, the sweet spot for ~200 OSDs is
> > 16384 PGs.
> > >>> RBD volume that has xxx GiBs of data gets striped across many PGs,
> > >>> so
> > the probability that the volume loses at least part of its’ data is
> > very significant.
> > >>> Someone correct me if I’m wrong, but I _know_ (from sad experience)
> > that with the current CRUSH map if 3 disks fail in 3 different hosts,
> > lots of instances (maybe all of them) have their IO stuck until 3
> > copies of data are restored.
> > >>>
> > >>> I just tested that by hand
> > >>> a 150GB volume will consist of ~150000/4=37500 objects
> > >>> When I list their location with “ceph osd map”, every time I get a
> > different pg, and a random mix of osds that host the PG.
> > >>>
> > >>> Thus, it is very likely that this volume will be lost when I lose
> > >>> any
> > 3 osds, as at least one of the pgs will be hosted on all of them. What
> > this probability is I don’t know - (I’m not good at statistics, is it
> > combinations?) - but generally the data I care most about is stored in
> > a multi-terrabyte volume, and even if the probability of failure was
> > 0.1%, that’s several orders of magnitute too high for me to be
> > comfortable.
> > >>>
> > >>> I’d like nothing more than for someone to tell me I’m wrong :-)
> > >>>
> > >>> Jan
> > >>>
> > >>>> On 10 Jun 2015, at 09:55, Dan van der Ster <dan@xxxxxxxxxxxxxx>
> > wrote:
> > >>>>
> > >>>> This is a CRUSH misconception. Triple drive failures only cause
> > >>>> data loss when they share a PG (e.g. ceph pg dump .. those
> > >>>> [x,y,z] triples of OSDs are the only ones that matter). If you
> > >>>> have very few OSDs, then its possibly true that any combination
> > >>>> of disks would lead to failure. But as you increase the number of
> > >>>> OSDs, the likelihood of triple sharing a PG decreases (even
> > >>>> though the number of 3-way combinations increases).
> > >>>>
> > >>>> Cheers, Dan
> > >>>>
> > >>>> On Wed, Jun 10, 2015 at 8:47 AM, Jan Schermer <jan@xxxxxxxxxxx>
> > wrote:
> > >>>>> Hidden danger in the default CRUSH rules is that if you lose 3
> > drives in 3 different hosts at the same time, you _will_ lose data,
> > and not just some data but possibly a piece of every rbd volume you
> > have...
> > >>>>> And the probability of that happening is sadly nowhere near
> > >>>>> zero. We
> > had drives drop out of cluster under load, which of course comes when a
> > drive fails, then another fails, then another fails… not pretty.
> > >>>>>
> > >>>>> Jan
> > >>>>>
> > >>>>>> On 09 Jun 2015, at 18:11, Robert LeBlanc <robert@xxxxxxxxxxxxx>
> > wrote:
> > >>>>>>
> > >>>>>> Signed PGP part
> > >>>>>> If you are using the default rule set (which I think has
> > >>>>>> min_size
> > 2),
> > >>>>>> you can sustain 1-4 disk failures or one host failures.
> > >>>>>>
> > >>>>>> The reason disk failures vary so wildly is that you can lose
> > >>>>>> all the disks in host.
> > >>>>>>
> > >>>>>> You can lose up to another 4 disks (in the same host) or 1 host
> > >>>>>> without data loss, but I/O will block until Ceph can replicate
> > >>>>>> at least one more copy (assuming the min_size 2 stated above).
> > >>>>>> ----------------
> > >>>>>> Robert LeBlanc
> > >>>>>> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62
> > >>>>>> B9F1
> > >>>>>>
> > >>>>>>
> > >>>>>> On Tue, Jun 9, 2015 at 9:53 AM, kevin parrikar  wrote:
> > >>>>>>> I have 4 node cluster each with 5 disks (4 OSD and 1 Operating
> > system also
> > >>>>>>> hosting 3 monitoring process) with default replica 3.
> > >>>>>>>
> > >>>>>>> Total OSD disks : 16
> > >>>>>>> Total Nodes : 4
> > >>>>>>>
> > >>>>>>> How can i calculate the
> > >>>>>>>
> > >>>>>>> Maximum number of disk failures my cluster can handle with out
> > any impact
> > >>>>>>> on current data and new writes.
> > >>>>>>> Maximum number of node failures  my cluster can handle with out
> > any impact
> > >>>>>>> on current data and new writes.
> > >>>>>>>
> > >>>>>>> Thanks for any help
> > >>>>>>>
> > >>>>>>> _______________________________________________
> > >>>>>>> ceph-users mailing list
> > >>>>>>> ceph-users@xxxxxxxxxxxxxx
> > >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>>>>>>
> > >>>>>>
> > >>>>>> _______________________________________________
> > >>>>>> ceph-users mailing list
> > >>>>>> ceph-users@xxxxxxxxxxxxxx
> > >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>>>>
> > >>>>> _______________________________________________
> > >>>>> ceph-users mailing list
> > >>>>> ceph-users@xxxxxxxxxxxxxx
> > >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>>
> > >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com