Questions about an example of ceph infrastructure

Francois Lafont <flafdivers@xxxxxxx> · Sun, 19 Apr 2015 06:22:44 +0200

Hi,

We are thinking about a ceph infrastructure and I have questions.
Here is the conceived (but not yet implemented) infrastructure:
(please, be careful to read the schema with a monospace font ;))

                 +---------+
                 |  users  |
                 |(browser)|
                 +----+----+
                      |
                      |
                 +----+----+
                 |         |
      +----------+   WAN   +------------+
      |          |         |            |
      |          +---------+            |
      |                                 |
      |                                 |
+-----+-----+                     +-----+-----+
|           |                     |           |
| monitor-1 |                     | monitor-3 |
| monitor-2 |                     |           |
|           |  Fiber connection   |           |
|           +---------------------+           |
|  OSD-1    |                     |  OSD-13   |
|  OSD-2    |                     |  OSD-14   |
|   ...     |                     |   ...     |
|  OSD-12   |                     |  OSD-24   |
|           |                     |           |
| client-a1 |                     | client-a2 |
| client-b1 |                     | client-b2 |
|           |                     |           |
+-----------+                     +-----------+
 Datacenter1                       Datacenter2
    (DC1)                             (DC2)

In DC1: 2 "OSD" nodes each with 6 OSDs daemons, one per disk.
        Journals in SSD, there are 2 SSD so 3 journals per SSD.
In DC2: the same config.

You can imagine for instance that:
- client-a1 and client-a2 are radosgw 
- client-b1 and client-b2 are web servers which use the Cephfs of the cluster.

And of course, the principle is to have data dispatched in DC1 and
DC2 (size == 2, one copy of the object in DC1, the other in DC2).

1. If I suppose that the latency between DC1 and DC2 (via the fiber
connection) is ok, I would like to know which throughput do I need to
avoid network bottleneck? Is there a rule to compute the needed
throughput? I suppose it depends on the disk throughputs?

For instance, I suppose the OSD disks in DC1 (and in DC2) has
a throughput equal to 150 MB/s, so with 12 OSD disk in each DC,
I have:

    12 x 150 = 1800 MB/s ie 1.8 GB/s, ie 14.4 Mbps

So, in the fiber, I need to have 14.4 Mbs. Is it correct? Maybe is it
too naive reasoning?

Furthermore I have not taken into account the SSD. How evaluate the
needed throughput more precisely?

2. I'm thinking about disaster recoveries too. For instance, if there
is a disaster in DC2, DC1 will work (fine). But if there is a disaster
in DC1, DC2 will not work (no quorum).

But now, I suppose there is a long and big disaster in DC1. So I suppose
DC1 is totally unreachable. In this case, I want to start (manually) my
ceph cluster in DC2. No problem with that, I have seen explanations in the
documentation to do that:

- I stop monitor-3
- I extract the monmap
- I remove monitor-1 and monitor-2 from this monmap
- I inject the new monmap in monitor-3 
- I restart monitor-3

After that, I have a DC1 unreachable but DC2 is working (with only one monitor).

But what happens if DC1 becomes again reachable? What will the behavior of
monitor-1 and monitor-2 in this case? Do monitor-1 and monitor-2 understand
that they belong no longer to the ceph cluster?

And now I imagine the worst scenario: DC1 becomes again reachable but the
switch in DC1 which is connected on the fiber is very long to restart so
that, during a short period, DC1 is reachable but the connection with DC2
is not yet operational. What happens in this period? client-a1 and client-b1
could write data in the cluster in this case, right? And the data in the
cluster could be compromised because DC1 in not aware of writings in DC2.
Am I wrong?

My conclusion about that is: in case of long disaster in DC1, I can restart
the ceph cluster in DC2 with the method described above (removing monitor-1
and monitor-2 from the monmap in monitor-3 etc.) but *only* *if* I can
definitively stop monitor-1 and monitor-2 in DC1 before (and if I can't, I
do nothing and I wait). Is it correct?

Thanks in advance for your explanations.

-- 
François Lafont
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com