Re: What a maximum theoretical and practical capacity in ceph cluster?

Dan Van Der Ster <daniel.vanderster@xxxxxxx> · Tue, 28 Oct 2014 08:52:44 +0000

> On 28 Oct 2014, at 09:30, Christian Balzer <chibi@xxxxxxx> wrote:
> 
> On Tue, 28 Oct 2014 07:46:30 +0000 Dan Van Der Ster wrote:
> 
>> 
>>> On 28 Oct 2014, at 08:25, Robert van Leeuwen
>>> <Robert.vanLeeuwen@xxxxxxxxxxxxx> wrote:
>>> 
>>>> By now we decide use a SuperMicro's SKU with 72 bays for HDD = 22 SSD
>>>> + 50 SATA drives.
>>>> Our racks can hold 10 this servers and 50 this racks in ceph cluster =
>>>> 36000 OSD's,
>>>> With 4tb SATA drives and replica = 2 and nerfull ratio = 0.8 we have
>>>> 40 Petabyte of useful capacity.
>>>> 
>>>> It's too big or normal use case for ceph?
>>> 
>>> I'm a bit worried about the replica count:
>>> The chances of 2 disks failing of 25000 at the same time becomes very
>>> significant.  (or a disk + server failure) Without doing any math my
>>> gut feeling says that 3 replica's is still not very comfortable.
>>> (especially if the disks come from the same batch)
>> 
>> It doesn’t quite work like that. You’re not going to lose data if _any_
>> two disks out of 25000 fail. You’ll only lose data if two disks that are
>> coupled in a PG are lost. So, while there are 25000^2 ways to lose two
>> disks, there are only nPGs disk pairs that matter for data loss. Said
>> another way, suppose you have one disk failed, what is the probability
>> of losing data? Well, the data loss scenario is going to happen if one
>> of the ~100 disks coupled with the failed disk also fails. So you see,
>> the chance of data loss with 2 replicas is roughly equivalent whether
>> you have 1000 OSDs or 25000 OSDs.
>> 
> 
> We keep having that discussion here and are still lacking a fully realistic
> model for this scenario. ^^ 
> Though I seem to recall work is being done the Ceph reliability
> calculator. 

Perhaps I’ve missed those discussions, but this principle has been in the reliability calculator since forever.. see “declustering”. There is no place to enter the number of OSDs, because it is just not relevant.

Just run ceph pg dump and look at those combinations of OSDs — those are the only simultaneous failures that matter. Every other combination of failures will not cause data loss. You can even write a Monte Carlo simulation of this — generate random pairs/triplets/etc. of OSDs and see if they would cause data loss. The probability of an M-way failure causing data loss will be nPGs/(nOSDs^M).

> Lets just say that with a replica of 2 and a set of 100 disks all the
> models and calculators I checked predict a data loss within a year.
> That DL probability goes down from 99.99% to just 0.04% in a year (which I
> would still consider too high) with a replica of 3.
> That's why I never use more than 22 HDDs in a RAID6 and keep this at 10-12
> for anything mission critical.

Yeah, 2 replicas doesn’t cut it — I agree on that. 3 is the minimum, actually, tolerance to 2 failures is the minimum (if you use EC, for example).

We first used 4 replicas in our RBD pool, but after realizing that not all OSDs are coupled together, we decreased to 3 replicas.

> And having likely multiple (even if unrelated) OSD failures at the same
> time can't be good for recovery times (increased risk) and cluster
> performance either.

Yes, I agree. I don’t really like that we can only limit max_backills/recoveries per OSD. What we need is a cluster-wide limit. I.e. I don’t want more than 30-40 OSDs backfilling at once in my 1000-OSD RBD cluster. Otherwise the latency penalty gets annoying.

Cheers, Dan
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com