Hi, We are thinking about a ceph infrastructure and I have questions. Here is the conceived (but not yet implemented) infrastructure: (please, be careful to read the schema with a monospace font ;)) +---------+ | users | |(browser)| +----+----+ | | +----+----+ | | +----------+ WAN +------------+ | | | | | +---------+ | | | | | +-----+-----+ +-----+-----+ | | | | | monitor-1 | | monitor-3 | | monitor-2 | | | | | Fiber connection | | | +---------------------+ | | OSD-1 | | OSD-13 | | OSD-2 | | OSD-14 | | ... | | ... | | OSD-12 | | OSD-24 | | | | | | client-a1 | | client-a2 | | client-b1 | | client-b2 | | | | | +-----------+ +-----------+ Datacenter1 Datacenter2 (DC1) (DC2) In DC1: 2 "OSD" nodes each with 6 OSDs daemons, one per disk. Journals in SSD, there are 2 SSD so 3 journals per SSD. In DC2: the same config. You can imagine for instance that: - client-a1 and client-a2 are radosgw - client-b1 and client-b2 are web servers which use the Cephfs of the cluster. And of course, the principle is to have data dispatched in DC1 and DC2 (size == 2, one copy of the object in DC1, the other in DC2). 1. If I suppose that the latency between DC1 and DC2 (via the fiber connection) is ok, I would like to know which throughput do I need to avoid network bottleneck? Is there a rule to compute the needed throughput? I suppose it depends on the disk throughputs? For instance, I suppose the OSD disks in DC1 (and in DC2) has a throughput equal to 150 MB/s, so with 12 OSD disk in each DC, I have: 12 x 150 = 1800 MB/s ie 1.8 GB/s, ie 14.4 Mbps So, in the fiber, I need to have 14.4 Mbs. Is it correct? Maybe is it too naive reasoning? Furthermore I have not taken into account the SSD. How evaluate the needed throughput more precisely? 2. I'm thinking about disaster recoveries too. For instance, if there is a disaster in DC2, DC1 will work (fine). But if there is a disaster in DC1, DC2 will not work (no quorum). But now, I suppose there is a long and big disaster in DC1. So I suppose DC1 is totally unreachable. In this case, I want to start (manually) my ceph cluster in DC2. No problem with that, I have seen explanations in the documentation to do that: - I stop monitor-3 - I extract the monmap - I remove monitor-1 and monitor-2 from this monmap - I inject the new monmap in monitor-3 - I restart monitor-3 After that, I have a DC1 unreachable but DC2 is working (with only one monitor). But what happens if DC1 becomes again reachable? What will the behavior of monitor-1 and monitor-2 in this case? Do monitor-1 and monitor-2 understand that they belong no longer to the ceph cluster? And now I imagine the worst scenario: DC1 becomes again reachable but the switch in DC1 which is connected on the fiber is very long to restart so that, during a short period, DC1 is reachable but the connection with DC2 is not yet operational. What happens in this period? client-a1 and client-b1 could write data in the cluster in this case, right? And the data in the cluster could be compromised because DC1 in not aware of writings in DC2. Am I wrong? My conclusion about that is: in case of long disaster in DC1, I can restart the ceph cluster in DC2 with the method described above (removing monitor-1 and monitor-2 from the monmap in monitor-3 etc.) but *only* *if* I can definitively stop monitor-1 and monitor-2 in DC1 before (and if I can't, I do nothing and I wait). Is it correct? Thanks in advance for your explanations. -- François Lafont _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com