Yeah, I know but I believe it was fixed so that a single copy is sufficient for recovery now (even with min_size=1)? Depends on what you want to achieve... The point is that even if we lost “just” 1% of data, that’s too much (>0%) when talking about customer data, and I know from experience that some volumes are unavailable when I lose 3 OSDs - and I don’t have that many volumes... Jan > On 10 Jun 2015, at 10:40, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > > I'm not a mathematician, but I'm pretty sure there are 200 choose 3 = > 1.3 million ways you can have 3 disks fail out of 200. nPGs = 16384 so > that many combinations would cause data loss. So I think 1.2% of > triple disk failures would lead to data loss. There might be another > factor of 3! that needs to be applied to nPGs -- I'm currently > thinking about that. > But you're right, if indeed you do ever lose an entire PG, _every_ RBD > device will have random holes in their data, like swiss cheese. > > BTW PGs can have stuck IOs without losing all three replicas -- see min_size. > > Cheers, Dan > > On Wed, Jun 10, 2015 at 10:20 AM, Jan Schermer <jan@xxxxxxxxxxx> wrote: >> When you increase the number of OSDs, you generaly would (and should) increase the number of PGs. For us, the sweet spot for ~200 OSDs is 16384 PGs. >> RBD volume that has xxx GiBs of data gets striped across many PGs, so the probability that the volume loses at least part of its’ data is very significant. >> Someone correct me if I’m wrong, but I _know_ (from sad experience) that with the current CRUSH map if 3 disks fail in 3 different hosts, lots of instances (maybe all of them) have their IO stuck until 3 copies of data are restored. >> >> I just tested that by hand >> a 150GB volume will consist of ~150000/4=37500 objects >> When I list their location with “ceph osd map”, every time I get a different pg, and a random mix of osds that host the PG. >> >> Thus, it is very likely that this volume will be lost when I lose any 3 osds, as at least one of the pgs will be hosted on all of them. What this probability is I don’t know - (I’m not good at statistics, is it combinations?) - but generally the data I care most about is stored in a multi-terrabyte volume, and even if the probability of failure was 0.1%, that’s several orders of magnitute too high for me to be comfortable. >> >> I’d like nothing more than for someone to tell me I’m wrong :-) >> >> Jan >> >>> On 10 Jun 2015, at 09:55, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: >>> >>> This is a CRUSH misconception. Triple drive failures only cause data >>> loss when they share a PG (e.g. ceph pg dump .. those [x,y,z] triples >>> of OSDs are the only ones that matter). If you have very few OSDs, >>> then its possibly true that any combination of disks would lead to >>> failure. But as you increase the number of OSDs, the likelihood of >>> triple sharing a PG decreases (even though the number of 3-way >>> combinations increases). >>> >>> Cheers, Dan >>> >>> On Wed, Jun 10, 2015 at 8:47 AM, Jan Schermer <jan@xxxxxxxxxxx> wrote: >>>> Hidden danger in the default CRUSH rules is that if you lose 3 drives in 3 different hosts at the same time, you _will_ lose data, and not just some data but possibly a piece of every rbd volume you have... >>>> And the probability of that happening is sadly nowhere near zero. We had drives drop out of cluster under load, which of course comes when a drive fails, then another fails, then another fails… not pretty. >>>> >>>> Jan >>>> >>>>> On 09 Jun 2015, at 18:11, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote: >>>>> >>>>> Signed PGP part >>>>> If you are using the default rule set (which I think has min_size 2), >>>>> you can sustain 1-4 disk failures or one host failures. >>>>> >>>>> The reason disk failures vary so wildly is that you can lose all the >>>>> disks in host. >>>>> >>>>> You can lose up to another 4 disks (in the same host) or 1 host >>>>> without data loss, but I/O will block until Ceph can replicate at >>>>> least one more copy (assuming the min_size 2 stated above). >>>>> ---------------- >>>>> Robert LeBlanc >>>>> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >>>>> >>>>> >>>>> On Tue, Jun 9, 2015 at 9:53 AM, kevin parrikar wrote: >>>>>> I have 4 node cluster each with 5 disks (4 OSD and 1 Operating system also >>>>>> hosting 3 monitoring process) with default replica 3. >>>>>> >>>>>> Total OSD disks : 16 >>>>>> Total Nodes : 4 >>>>>> >>>>>> How can i calculate the >>>>>> >>>>>> Maximum number of disk failures my cluster can handle with out any impact >>>>>> on current data and new writes. >>>>>> Maximum number of node failures my cluster can handle with out any impact >>>>>> on current data and new writes. >>>>>> >>>>>> Thanks for any help >>>>>> >>>>>> _______________________________________________ >>>>>> ceph-users mailing list >>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>> >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@xxxxxxxxxxxxxx >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com