Re: calculating maximum number of disk and node failure that can be handled by cluster with out data loss

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Wed, 10 Jun 2015 10:40:42 +0200

I'm not a mathematician, but I'm pretty sure there are 200 choose 3 =
1.3 million ways you can have 3 disks fail out of 200. nPGs = 16384 so
that many combinations would cause data loss. So I think 1.2% of
triple disk failures would lead to data loss. There might be another
factor of 3! that needs to be applied to nPGs -- I'm currently
thinking about that.
But you're right, if indeed you do ever lose an entire PG, _every_ RBD
device will have random holes in their data, like swiss cheese.

BTW PGs can have stuck IOs without losing all three replicas -- see min_size.

Cheers, Dan

On Wed, Jun 10, 2015 at 10:20 AM, Jan Schermer <jan@xxxxxxxxxxx> wrote:
> When you increase the number of OSDs, you generaly would (and should) increase the number of PGs. For us, the sweet spot for ~200 OSDs is 16384 PGs.
> RBD volume that has xxx GiBs of data gets striped across many PGs, so the probability that the volume loses at least part of its’ data is very significant.
> Someone correct me if I’m wrong, but I _know_ (from sad experience) that with the current CRUSH map if 3 disks fail in 3 different hosts, lots of instances (maybe all of them) have their IO stuck until 3 copies of data are restored.
>
> I just tested that by hand
> a 150GB volume will consist of ~150000/4=37500 objects
> When I list their location with “ceph osd map”, every time I get a different pg, and a random mix of osds that host the PG.
>
> Thus, it is very likely that this volume will be lost when I lose any 3 osds, as at least one of the pgs will be hosted on all of them. What this probability is I don’t know - (I’m not good at statistics, is it combinations?) - but generally the data I care most about is stored in a multi-terrabyte volume, and even if the probability of failure was 0.1%, that’s several orders of magnitute too high for me to be comfortable.
>
> I’d like nothing more than for someone to tell me I’m wrong :-)
>
> Jan
>
>> On 10 Jun 2015, at 09:55, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
>>
>> This is a CRUSH misconception. Triple drive failures only cause data
>> loss when they share a PG (e.g. ceph pg dump .. those [x,y,z] triples
>> of OSDs are the only ones that matter). If you have very few OSDs,
>> then its possibly true that any combination of disks would lead to
>> failure. But as you increase the number of OSDs, the likelihood of
>> triple sharing a PG decreases (even though the number of 3-way
>> combinations increases).
>>
>> Cheers, Dan
>>
>> On Wed, Jun 10, 2015 at 8:47 AM, Jan Schermer <jan@xxxxxxxxxxx> wrote:
>>> Hidden danger in the default CRUSH rules is that if you lose 3 drives in 3 different hosts at the same time, you _will_ lose data, and not just some data but possibly a piece of every rbd volume you have...
>>> And the probability of that happening is sadly nowhere near zero. We had drives drop out of cluster under load, which of course comes when a drive fails, then another fails, then another fails… not pretty.
>>>
>>> Jan
>>>
>>>> On 09 Jun 2015, at 18:11, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
>>>>
>>>> Signed PGP part
>>>> If you are using the default rule set (which I think has min_size 2),
>>>> you can sustain 1-4 disk failures or one host failures.
>>>>
>>>> The reason disk failures vary so wildly is that you can lose all the
>>>> disks in host.
>>>>
>>>> You can lose up to another 4 disks (in the same host) or 1 host
>>>> without data loss, but I/O will block until Ceph can replicate at
>>>> least one more copy (assuming the min_size 2 stated above).
>>>> ----------------
>>>> Robert LeBlanc
>>>> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>
>>>>
>>>> On Tue, Jun 9, 2015 at 9:53 AM, kevin parrikar  wrote:
>>>>> I have 4 node cluster each with 5 disks (4 OSD and 1 Operating system also
>>>>> hosting 3 monitoring process) with default replica 3.
>>>>>
>>>>> Total OSD disks : 16
>>>>> Total Nodes : 4
>>>>>
>>>>> How can i calculate the
>>>>>
>>>>> Maximum number of disk failures my cluster can handle with out  any impact
>>>>> on current data and new writes.
>>>>> Maximum number of node failures  my cluster can handle with out any impact
>>>>> on current data and new writes.
>>>>>
>>>>> Thanks for any help
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com