Re: calculating maximum number of disk and node failure that can be handled by cluster with out data loss

Jan Schermer <jan@xxxxxxxxxxx> · Wed, 10 Jun 2015 10:20:36 +0200

When you increase the number of OSDs, you generaly would (and should) increase the number of PGs. For us, the sweet spot for ~200 OSDs is 16384 PGs.
RBD volume that has xxx GiBs of data gets striped across many PGs, so the probability that the volume loses at least part of its’ data is very significant.
Someone correct me if I’m wrong, but I _know_ (from sad experience) that with the current CRUSH map if 3 disks fail in 3 different hosts, lots of instances (maybe all of them) have their IO stuck until 3 copies of data are restored.

I just tested that by hand
a 150GB volume will consist of ~150000/4=37500 objects
When I list their location with “ceph osd map”, every time I get a different pg, and a random mix of osds that host the PG.

Thus, it is very likely that this volume will be lost when I lose any 3 osds, as at least one of the pgs will be hosted on all of them. What this probability is I don’t know - (I’m not good at statistics, is it combinations?) - but generally the data I care most about is stored in a multi-terrabyte volume, and even if the probability of failure was 0.1%, that’s several orders of magnitute too high for me to be comfortable.

I’d like nothing more than for someone to tell me I’m wrong :-)

Jan

> On 10 Jun 2015, at 09:55, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
> 
> This is a CRUSH misconception. Triple drive failures only cause data
> loss when they share a PG (e.g. ceph pg dump .. those [x,y,z] triples
> of OSDs are the only ones that matter). If you have very few OSDs,
> then its possibly true that any combination of disks would lead to
> failure. But as you increase the number of OSDs, the likelihood of
> triple sharing a PG decreases (even though the number of 3-way
> combinations increases).
> 
> Cheers, Dan
> 
> On Wed, Jun 10, 2015 at 8:47 AM, Jan Schermer <jan@xxxxxxxxxxx> wrote:
>> Hidden danger in the default CRUSH rules is that if you lose 3 drives in 3 different hosts at the same time, you _will_ lose data, and not just some data but possibly a piece of every rbd volume you have...
>> And the probability of that happening is sadly nowhere near zero. We had drives drop out of cluster under load, which of course comes when a drive fails, then another fails, then another fails… not pretty.
>> 
>> Jan
>> 
>>> On 09 Jun 2015, at 18:11, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
>>> 
>>> Signed PGP part
>>> If you are using the default rule set (which I think has min_size 2),
>>> you can sustain 1-4 disk failures or one host failures.
>>> 
>>> The reason disk failures vary so wildly is that you can lose all the
>>> disks in host.
>>> 
>>> You can lose up to another 4 disks (in the same host) or 1 host
>>> without data loss, but I/O will block until Ceph can replicate at
>>> least one more copy (assuming the min_size 2 stated above).
>>> ----------------
>>> Robert LeBlanc
>>> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>> 
>>> 
>>> On Tue, Jun 9, 2015 at 9:53 AM, kevin parrikar  wrote:
>>>> I have 4 node cluster each with 5 disks (4 OSD and 1 Operating system also
>>>> hosting 3 monitoring process) with default replica 3.
>>>> 
>>>> Total OSD disks : 16
>>>> Total Nodes : 4
>>>> 
>>>> How can i calculate the
>>>> 
>>>> Maximum number of disk failures my cluster can handle with out  any impact
>>>> on current data and new writes.
>>>> Maximum number of node failures  my cluster can handle with out any impact
>>>> on current data and new writes.
>>>> 
>>>> Thanks for any help
>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> 
>>> 
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com