Re: Questions about an example of ceph infrastructure

Christian Balzer <chibi@xxxxxxx> · Sun, 19 Apr 2015 17:50:32 +0900

Hello,

On Sun, 19 Apr 2015 06:22:44 +0200 Francois Lafont wrote:

> Hi,
> 
> We are thinking about a ceph infrastructure and I have questions.
> Here is the conceived (but not yet implemented) infrastructure:
> (please, be careful to read the schema with a monospace font ;))
> 
> 
>                  +---------+
>                  |  users  |
>                  |(browser)|
>                  +----+----+
>                       |
>                       |
>                  +----+----+
>                  |         |
>       +----------+   WAN   +------------+
>       |          |         |            |
>       |          +---------+            |
>       |                                 |
>       |                                 |
> +-----+-----+                     +-----+-----+
> |           |                     |           |
> | monitor-1 |                     | monitor-3 |
> | monitor-2 |                     |           |
> |           |  Fiber connection   |           |
> |           +---------------------+           |
> |  OSD-1    |                     |  OSD-13   |
> |  OSD-2    |                     |  OSD-14   |
> |   ...     |                     |   ...     |
> |  OSD-12   |                     |  OSD-24   |
> |           |                     |           |
> | client-a1 |                     | client-a2 |
> | client-b1 |                     | client-b2 |
> |           |                     |           |
> +-----------+                     +-----------+
>  Datacenter1                       Datacenter2
>     (DC1)                             (DC2)
> 
For starters, make that 5 MONs. 
It won't really help you with your problem of keeping a quorum when
loosing a DC, but being able to loose more than 1 monitor will come in
handy.
Note that MONs don't really need to be dedicated nodes, if you know what
you're doing and have enough resources (most importantly fast I/O aka SSD
for the leveldb) on another machine.

> In DC1: 2 "OSD" nodes each with 6 OSDs daemons, one per disk.
>         Journals in SSD, there are 2 SSD so 3 journals per SSD.
> In DC2: the same config.
> 
Out of curiosity, is that a 1U case with 8 2.5" bays, or why that
(relatively low) density per node?

4 nodes make a pretty small cluster, if you loose a SSD or a whole node
your cluster will get rather busy and may run out of space if you filled
it more than 50%.

> You can imagine for instance that:
> - client-a1 and client-a2 are radosgw 
> - client-b1 and client-b2 are web servers which use the Cephfs of the
> cluster.
> 
> And of course, the principle is to have data dispatched in DC1 and
> DC2 (size == 2, one copy of the object in DC1, the other in DC2).
> 
Unless you OSDs are RAID1s, a replica of 2 is basically asking Murphy to
"bless" you with a double disk failure. A very distinct probability with
24 HDDs. 
With OSDs backed by plain HDDs you really want a replica size of 3.

Normally you'd configure Ceph to NOT set OSDs out automatically if a DC
fails (mon_osd_down_out_subtree_limit) but in the case of a prolonged DC
outage you'll want to restore redundancy and set those OSDs out. 
Which means you will need 3 times the actual data capacity on your
surviving 2 nodes.
In other words, if your 24 OSDs are 2TB each you can "safely" only store
8TB in your cluster (48TB/3(replica)/2(DCs).

> 
> 1. If I suppose that the latency between DC1 and DC2 (via the fiber
> connection) is ok, I would like to know which throughput do I need to
> avoid network bottleneck? Is there a rule to compute the needed
> throughput? I suppose it depends on the disk throughputs?
> 
Fiber isn't magical FTL (faster than light) communications and the latency
depends (mostly) on the length (which you may or may not control) and the
protocol used. 
A 2m long GbE link has a much worse latency than the same length in
Infiniband. 

You will of course need "enough" bandwidth, but what is going to kill
(making it rather slow) your cluster will be the latency between those DCs.

Each write will have to be acknowledged and this is where every ms less of
latency will make a huge difference.

> For instance, I suppose the OSD disks in DC1 (and in DC2) has
> a throughput equal to 150 MB/s, so with 12 OSD disk in each DC,
> I have:
> 
>     12 x 150 = 1800 MB/s ie 1.8 GB/s, ie 14.4 Mbps
> 
> So, in the fiber, I need to have 14.4 Mbs. Is it correct? 

How do you get from 1.8 GigaByte/s to 14.4 Megabit/s?
You to multiply, not divide. 
And assuming 10 bits (not 8) for a Byte when serialized never hurts. 
So that's 18 Gb/s.

>Maybe is it
> too naive reasoning?
> 
Very much so. Your disks (even with SSD journals) will not write 150MB/s,
because Ceph doesn't do long sequential writes (though 4MB blobs are
better than nothing) and more importantly these writes are concurrently.
So while one client is writing to an object at one end of your HDD another
one may write to a very different, distant location. Seeking delays.
With more than one client, you'd be lucky to see 50-70MB/s per HDD.

> Furthermore I have not taken into account the SSD. How evaluate the
> needed throughput more precisely?
> 
You need to consider the speed of the devices, their local bus (hopefully
fast enough) and the network.

All things considered you probably want a redundant link (but with
bandwidth aggregation if both links are up). 
10Gb/s per link would do, but 40Gb/s links (or your storage network on
something other than Ethernet) will have less latency on top of the
capacity for future expansion.

I'll leave the failover bits to others.

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com