Re: calculating maximum number of disk and node failure that can be handled by cluster with out data loss

Jan Schermer <jan@xxxxxxxxxxx> · Wed, 10 Jun 2015 10:47:38 +0200

Yeah, I know but I believe it was fixed so that a single copy is sufficient for recovery now (even with min_size=1)? Depends on what you want to achieve...

The point is that even if we lost “just” 1% of data, that’s too much (>0%) when talking about customer data, and I know from experience that some volumes are unavailable when I lose 3 OSDs -  and I don’t have that many volumes...

Jan

> On 10 Jun 2015, at 10:40, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
> 
> I'm not a mathematician, but I'm pretty sure there are 200 choose 3 =
> 1.3 million ways you can have 3 disks fail out of 200. nPGs = 16384 so
> that many combinations would cause data loss. So I think 1.2% of
> triple disk failures would lead to data loss. There might be another
> factor of 3! that needs to be applied to nPGs -- I'm currently
> thinking about that.
> But you're right, if indeed you do ever lose an entire PG, _every_ RBD
> device will have random holes in their data, like swiss cheese.
> 
> BTW PGs can have stuck IOs without losing all three replicas -- see min_size.
> 
> Cheers, Dan
> 
> On Wed, Jun 10, 2015 at 10:20 AM, Jan Schermer <jan@xxxxxxxxxxx> wrote:
>> When you increase the number of OSDs, you generaly would (and should) increase the number of PGs. For us, the sweet spot for ~200 OSDs is 16384 PGs.
>> RBD volume that has xxx GiBs of data gets striped across many PGs, so the probability that the volume loses at least part of its’ data is very significant.
>> Someone correct me if I’m wrong, but I _know_ (from sad experience) that with the current CRUSH map if 3 disks fail in 3 different hosts, lots of instances (maybe all of them) have their IO stuck until 3 copies of data are restored.
>> 
>> I just tested that by hand
>> a 150GB volume will consist of ~150000/4=37500 objects
>> When I list their location with “ceph osd map”, every time I get a different pg, and a random mix of osds that host the PG.
>> 
>> Thus, it is very likely that this volume will be lost when I lose any 3 osds, as at least one of the pgs will be hosted on all of them. What this probability is I don’t know - (I’m not good at statistics, is it combinations?) - but generally the data I care most about is stored in a multi-terrabyte volume, and even if the probability of failure was 0.1%, that’s several orders of magnitute too high for me to be comfortable.
>> 
>> I’d like nothing more than for someone to tell me I’m wrong :-)
>> 
>> Jan
>> 
>>> On 10 Jun 2015, at 09:55, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
>>> 
>>> This is a CRUSH misconception. Triple drive failures only cause data
>>> loss when they share a PG (e.g. ceph pg dump .. those [x,y,z] triples
>>> of OSDs are the only ones that matter). If you have very few OSDs,
>>> then its possibly true that any combination of disks would lead to
>>> failure. But as you increase the number of OSDs, the likelihood of
>>> triple sharing a PG decreases (even though the number of 3-way
>>> combinations increases).
>>> 
>>> Cheers, Dan
>>> 
>>> On Wed, Jun 10, 2015 at 8:47 AM, Jan Schermer <jan@xxxxxxxxxxx> wrote:
>>>> Hidden danger in the default CRUSH rules is that if you lose 3 drives in 3 different hosts at the same time, you _will_ lose data, and not just some data but possibly a piece of every rbd volume you have...
>>>> And the probability of that happening is sadly nowhere near zero. We had drives drop out of cluster under load, which of course comes when a drive fails, then another fails, then another fails… not pretty.
>>>> 
>>>> Jan
>>>> 
>>>>> On 09 Jun 2015, at 18:11, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
>>>>> 
>>>>> Signed PGP part
>>>>> If you are using the default rule set (which I think has min_size 2),
>>>>> you can sustain 1-4 disk failures or one host failures.
>>>>> 
>>>>> The reason disk failures vary so wildly is that you can lose all the
>>>>> disks in host.
>>>>> 
>>>>> You can lose up to another 4 disks (in the same host) or 1 host
>>>>> without data loss, but I/O will block until Ceph can replicate at
>>>>> least one more copy (assuming the min_size 2 stated above).
>>>>> ----------------
>>>>> Robert LeBlanc
>>>>> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>> 
>>>>> 
>>>>> On Tue, Jun 9, 2015 at 9:53 AM, kevin parrikar  wrote:
>>>>>> I have 4 node cluster each with 5 disks (4 OSD and 1 Operating system also
>>>>>> hosting 3 monitoring process) with default replica 3.
>>>>>> 
>>>>>> Total OSD disks : 16
>>>>>> Total Nodes : 4
>>>>>> 
>>>>>> How can i calculate the
>>>>>> 
>>>>>> Maximum number of disk failures my cluster can handle with out  any impact
>>>>>> on current data and new writes.
>>>>>> Maximum number of node failures  my cluster can handle with out any impact
>>>>>> on current data and new writes.
>>>>>> 
>>>>>> Thanks for any help
>>>>>> 
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com