Re: calculating maximum number of disk and node failure that can be handled by cluster with out data loss

Vasiliy Angapov <angapov@xxxxxxxxx> · Wed, 10 Jun 2015 23:53:48 +0300

Hi,
I also wrote a simple script which calculates the data loss probabilities for triple disk failure. Here are some numbers:
OSDs: 10,   Pr: 138.89%
OSDs: 20,   Pr: 29.24%
OSDs: 30,   Pr: 12.32%
OSDs: 40,   Pr: 6.75%
OSDs: 50,   Pr: 4.25%
OSDs: 100, Pr: 1.03%
OSDs: 200, Pr: 0.25%
OSDs: 500, Pr: 0.04%

Here i assumed we have 100PGs per OSD. Also there is a constraint for 3 disks not to be in one host because this will not lead to a failure. For situation where all disks are evenly distributed between 10 hosts it gives us a correction coefficient of 83% so for 50 OSDs it will be something like 3.53% instead of 4.25%.

There is a further constraint for 2 disks in one host and 1 disk on another but that's just adds unneeded complexity. Numbers will not change significantly.
And actually triple simultaneous failure is itself not very likely to happen, so i believe that starting from 100 OSDs we can somewhat relax about data  failure. 

BTW, this presentation has more math http://www.slideshare.net/kioecn/build-an-highperformance-and-highdurable-block-storage-service-based-on-ceph

Regards, Vasily.

On Wed, Jun 10, 2015 at 12:38 PM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
OK I wrote a quick script to simulate triple failures and count how

many would have caused data loss. The script gets your list of OSDs

and PGs, then simulates failures and checks if any permutation of that

failure matches a PG.

Here's an example with 10000 simulations on our production cluster:

# ./simulate-failures.py

We have 1232 OSDs and 21056 PGs, hence 21056 combinations e.g. like

this: (945, 910, 399)

Simulating 10000 failures

Simulated 1000 triple failures. Data loss incidents = 0

Data loss incident with failure (676, 451, 931)

Simulated 2000 triple failures. Data loss incidents = 1

Simulated 3000 triple failures. Data loss incidents = 1

Simulated 4000 triple failures. Data loss incidents = 1

Simulated 5000 triple failures. Data loss incidents = 1

Simulated 6000 triple failures. Data loss incidents = 1

Simulated 7000 triple failures. Data loss incidents = 1

Simulated 8000 triple failures. Data loss incidents = 1

Data loss incident with failure (1031, 1034, 806)

Data loss incident with failure (449, 644, 329)

Simulated 9000 triple failures. Data loss incidents = 3

Simulated 10000 triple failures. Data loss incidents = 3

End of simulation: Out of 10000 triple failures, 3 caused a data loss incident

The script is here:

https://github.com/cernceph/ceph-scripts/blob/master/tools/durability/simulate-failures.py

Give it a try (on your test clusters!)

Cheers, Dan

On Wed, Jun 10, 2015 at 10:47 AM, Jan Schermer <jan@xxxxxxxxxxx> wrote:

> Yeah, I know but I believe it was fixed so that a single copy is sufficient for recovery now (even with min_size=1)? Depends on what you want to achieve...

>

> The point is that even if we lost “just” 1% of data, that’s too much (>0%) when talking about customer data, and I know from experience that some volumes are unavailable when I lose 3 OSDs -  and I don’t have that many volumes...

>

> Jan

>

>> On 10 Jun 2015, at 10:40, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:

>>

>> I'm not a mathematician, but I'm pretty sure there are 200 choose 3 =

>> 1.3 million ways you can have 3 disks fail out of 200. nPGs = 16384 so

>> that many combinations would cause data loss. So I think 1.2% of

>> triple disk failures would lead to data loss. There might be another

>> factor of 3! that needs to be applied to nPGs -- I'm currently

>> thinking about that.

>> But you're right, if indeed you do ever lose an entire PG, _every_ RBD

>> device will have random holes in their data, like swiss cheese.

>>

>> BTW PGs can have stuck IOs without losing all three replicas -- see min_size.

>>

>> Cheers, Dan

>>

>> On Wed, Jun 10, 2015 at 10:20 AM, Jan Schermer <jan@xxxxxxxxxxx> wrote:

>>> When you increase the number of OSDs, you generaly would (and should) increase the number of PGs. For us, the sweet spot for ~200 OSDs is 16384 PGs.

>>> RBD volume that has xxx GiBs of data gets striped across many PGs, so the probability that the volume loses at least part of its’ data is very significant.

>>> Someone correct me if I’m wrong, but I _know_ (from sad experience) that with the current CRUSH map if 3 disks fail in 3 different hosts, lots of instances (maybe all of them) have their IO stuck until 3 copies of data are restored.

>>>

>>> I just tested that by hand

>>> a 150GB volume will consist of ~150000/4=37500 objects

>>> When I list their location with “ceph osd map”, every time I get a different pg, and a random mix of osds that host the PG.

>>>

>>> Thus, it is very likely that this volume will be lost when I lose any 3 osds, as at least one of the pgs will be hosted on all of them. What this probability is I don’t know - (I’m not good at statistics, is it combinations?) - but generally the data I care most about is stored in a multi-terrabyte volume, and even if the probability of failure was 0.1%, that’s several orders of magnitute too high for me to be comfortable.

>>>

>>> I’d like nothing more than for someone to tell me I’m wrong :-)

>>>

>>> Jan

>>>

>>>> On 10 Jun 2015, at 09:55, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:

>>>>

>>>> This is a CRUSH misconception. Triple drive failures only cause data

>>>> loss when they share a PG (e.g. ceph pg dump .. those [x,y,z] triples

>>>> of OSDs are the only ones that matter). If you have very few OSDs,

>>>> then its possibly true that any combination of disks would lead to

>>>> failure. But as you increase the number of OSDs, the likelihood of

>>>> triple sharing a PG decreases (even though the number of 3-way

>>>> combinations increases).

>>>>

>>>> Cheers, Dan

>>>>

>>>> On Wed, Jun 10, 2015 at 8:47 AM, Jan Schermer <jan@xxxxxxxxxxx> wrote:

>>>>> Hidden danger in the default CRUSH rules is that if you lose 3 drives in 3 different hosts at the same time, you _will_ lose data, and not just some data but possibly a piece of every rbd volume you have...

>>>>> And the probability of that happening is sadly nowhere near zero. We had drives drop out of cluster under load, which of course comes when a drive fails, then another fails, then another fails… not pretty.

>>>>>

>>>>> Jan

>>>>>

>>>>>> On 09 Jun 2015, at 18:11, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:

>>>>>>

>>>>>> Signed PGP part

>>>>>> If you are using the default rule set (which I think has min_size 2),

>>>>>> you can sustain 1-4 disk failures or one host failures.

>>>>>>

>>>>>> The reason disk failures vary so wildly is that you can lose all the

>>>>>> disks in host.

>>>>>>

>>>>>> You can lose up to another 4 disks (in the same host) or 1 host

>>>>>> without data loss, but I/O will block until Ceph can replicate at

>>>>>> least one more copy (assuming the min_size 2 stated above).

>>>>>> ----------------

>>>>>> Robert LeBlanc

>>>>>> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

>>>>>>

>>>>>>

>>>>>> On Tue, Jun 9, 2015 at 9:53 AM, kevin parrikar  wrote:

>>>>>>> I have 4 node cluster each with 5 disks (4 OSD and 1 Operating system also

>>>>>>> hosting 3 monitoring process) with default replica 3.

>>>>>>>

>>>>>>> Total OSD disks : 16

>>>>>>> Total Nodes : 4

>>>>>>>

>>>>>>> How can i calculate the

>>>>>>>

>>>>>>> Maximum number of disk failures my cluster can handle with out  any impact

>>>>>>> on current data and new writes.

>>>>>>> Maximum number of node failures  my cluster can handle with out any impact

>>>>>>> on current data and new writes.

>>>>>>>

>>>>>>> Thanks for any help

>>>>>>>

>>>>>>> _______________________________________________

>>>>>>> ceph-users mailing list

>>>>>>> ceph-users@xxxxxxxxxxxxxx

>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>>>>>

>>>>>>

>>>>>> _______________________________________________

>>>>>> ceph-users mailing list

>>>>>> ceph-users@xxxxxxxxxxxxxx

>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>>>

>>>>> _______________________________________________

>>>>> ceph-users mailing list

>>>>> ceph-users@xxxxxxxxxxxxxx

>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>

>

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com