Re: Questions about an example of ceph infrastructure

Francois Lafont <flafdivers@xxxxxxx> · Mon, 20 Apr 2015 04:16:01 +0200

Hi,

Christian Balzer wrote:

> For starters, make that 5 MONs. 
> It won't really help you with your problem of keeping a quorum when
> loosing a DC, but being able to loose more than 1 monitor will come in
> handy.
> Note that MONs don't really need to be dedicated nodes, if you know what
> you're doing and have enough resources (most importantly fast I/O aka SSD
> for the leveldb) on another machine.

Ok, I keep that in my head.

>> In DC1: 2 "OSD" nodes each with 6 OSDs daemons, one per disk.
>>         Journals in SSD, there are 2 SSD so 3 journals per SSD.
>> In DC2: the same config.
>>
> Out of curiosity, is that a 1U case with 8 2.5" bays, or why that
> (relatively low) density per node?

Sorry I have no idea because, in fact, it was just an example to be
concrete. So I have taken a (imaginary) server with 8 disks and 2 SSDs
(among the 8 disks, 2 for the OS in RAID1 soft). Currently, I can't be
precise about hardware because were are absolutely not fixed about the
budget (if we get it!), there are lot of uncertainties.

> 4 nodes make a pretty small cluster, if you loose a SSD or a whole node
> your cluster will get rather busy and may run out of space if you filled
> it more than 50%.

Yes indeed, it's a relevant remark. If the cluster is ~50% filled and if a
node crashes in a DC, the other node in the same DC will be 100% filled and
the cluster will be blocked. Indeed, the cluster is probably too small.

> Unless you OSDs are RAID1s, a replica of 2 is basically asking Murphy to
> "bless" you with a double disk failure. A very distinct probability with
> 24 HDDs. 

The probability of a *simultaneous* disk failure in DC1 and in DC2 seems to
me relatively low. For instance, if a disk fails in DC1 and if the rebalancing
of data takes ~ 1 or 2 hours, it seems to me acceptable. But maybe I'm too
optimistic... ;)

> With OSDs backed by plain HDDs you really want a replica size of 3.

But the "2-DCs" topology isn't really suitable for a replica size of 3, no?
Is the replica size of 2 so risky?

> Normally you'd configure Ceph to NOT set OSDs out automatically if a DC
> fails (mon_osd_down_out_subtree_limit)

I didn't known this option. In the online doc, the explanations are not
clear enough for me and I'm not sure to understand its meaning. If I set:

    mon_osd_down_out_subtree_limit = datacenter

what are the consequences?

    - If all OSDs in DC2 are unreachable, these OSDs will not be marked out
    - and if only several OSDs in DC2 are unreachable but not all in DC2,
      these OSDs will be marked out.

Am I correct?

> but in the case of a prolonged DC
> outage you'll want to restore redundancy and set those OSDs out. 
> Which means you will need 3 times the actual data capacity on your
> surviving 2 nodes.
> In other words, if your 24 OSDs are 2TB each you can "safely" only store
> 8TB in your cluster (48TB/3(replica)/2(DCs).

I see but my idea was just to have a long enough disaster in DC1 so that 
I must restart the cluster in degraded mode in DC2, but not long enough
so that I must restore a total redundancy in DC2. Personally I didn't
consider this case and, unfortunately, I think we will never have a budget
to be able to restore a total redundancy in just one datacenter. I'm afraid
that it a unreachable whealth for us.

> Fiber isn't magical FTL (faster than light) communications and the latency
> depends (mostly) on the length (which you may or may not control) and the
> protocol used. 
> A 2m long GbE link has a much worse latency than the same length in
> Infiniband.

In our case, if we can implement this infrastructure (if we have the
budget etc.), the connection would be probably 2 dark fiber with 10km
between DC1 and DC2. And we'll use Ethernet switchs with SFP transceivers
(if you have good references of switchs, I'm interested). I suppose it
could be possible to have low latencies in this case, no?

> You will of course need "enough" bandwidth, but what is going to kill
> (making it rather slow) your cluster will be the latency between those DCs.
> 
> Each write will have to be acknowledged and this is where every ms less of
> latency will make a huge difference.

Yes indeed, I understand.

>> For instance, I suppose the OSD disks in DC1 (and in DC2) has
>> a throughput equal to 150 MB/s, so with 12 OSD disk in each DC,
>> I have:
>>
>>     12 x 150 = 1800 MB/s ie 1.8 GB/s, ie 14.4 Mbps
>>
>> So, in the fiber, I need to have 14.4 Mbs. Is it correct? 
> 
> How do you get from 1.8 GigaByte/s to 14.4 Megabit/s?

Sorry, it was a misprint, I wanted to write 14.4 Gb/s of course. ;)

> You to multiply, not divide. 
> And assuming 10 bits (not 8) for a Byte when serialized never hurts. 
> So that's 18 Gb/s.

Yes, indeed. So the "naive" estimation gives 18 Gb/s (Ok for 10 bits
instead of 8).

>> Maybe is it too naive reasoning?
>
> Very much so. Your disks (even with SSD journals) will not write 150MB/s,
> because Ceph doesn't do long sequential writes (though 4MB blobs are
> better than nothing) and more importantly these writes are concurrently.
> So while one client is writing to an object at one end of your HDD another
> one may write to a very different, distant location. Seeking delays.
> With more than one client, you'd be lucky to see 50-70MB/s per HDD.

Ok, but if I follow your explanations, the throughput obtained with the
"naive" estimation is too big. In fact, I could just have:

   12 x 70 = 840 MB/s ie 0.840 GB/s => 8.4 Gb/s

Correct?

>> Furthermore I have not taken into account the SSD. How evaluate the
>> needed throughput more precisely?
>>
> You need to consider the speed of the devices, their local bus (hopefully
> fast enough) and the network.
> 
> All things considered you probably want a redundant link (but with
> bandwidth aggregation if both links are up). 
> 10Gb/s per link would do, but 40Gb/s links (or your storage network on
> something other than Ethernet) will have less latency on top of the
> capacity for future expansion.

Ok, thanks for your help Christian.

-- 
François Lafont
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com