Re: Two datacenter resilient design with a quorum site

Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> · Thu, 18 Jan 2018 08:57:32 -0500

On Tue, Jan 16, 2018 at 2:17 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> On Tue, Jan 16, 2018 at 6:07 AM Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx>
> wrote:
>>
>> I found a few WAN RBD cluster design discussions, but not a local one,
>> so was wonderinng if anyone has experience with a resilience-oriented
>> short distance (<10 km, redundant fiber connections) cluster in two
>> datacenters with a third site for quorum purposes only?
>>
>> I can see two types of scenarios:
>>
>> 1. Two (or even number) of OSD nodes at each site, 4x replication
>> (size 4, min_size 2).  Three MONs, one at each site to handle split
>> brain.
>>
>> Question: How does the cluster handle the loss of communication
>> between the OSD sites A and B, while both can communicate with the
>> quorum site C?  It seems, one of the sites should suspend, as OSDs
>> will not be able to communicate between sites.
>
>
> Sadly this won't work — the OSDs on each side will report their peers on the
> other side down, but both will be able to connect to a live monitor.
> (Assuming the quorum site holds the leader monitor, anyway — if one of the
> main sites holds what should be the leader, you'll get into a monitor
> election storm instead.) You'll need your own netsplit monitoring to shut
> down one site if that kind of connection cut is a possibility.

What about running a split brain aware too, such as Pacemaker, and
running a copy of the same VM as a mon at each site?  In case of a
split brain network separation, Pacemaker would (aware via third site)
stop the mon on site A and bring up the mon on site B (or whatever the
rules are set to).  I read earlier that a mon with the same IP, name
and keyring would just look to the ceph cluster as a very old mon, but
still able to vote for quorum.

Vincent Godin also described an HSRP based method, which would
accomplish this goal via network routing.  That seems like a good
approach too, I just need to check on HSRP availability.

>
>>
>>
>> 2. 3x replication for performance or cost (size 3, min_size 2 - or
>> even min_size 1 and strict monitoring).  Two replicas and two MONs at
>> one site and one replica and one MON at the other site.
>>
>> Question: in case of a permanent failure of the main site (with two
>> replicas), how to manually force the other site (with one replica and
>> MON) to provide storage?  I would think a CRUSH map change and
>> modifying ceph.conf to include just one MON, then build two more MONs
>> locally and add?
>
>
> Yep, pretty much that. You won't need to change ceph.conf to just one mon so
> much as to include the current set of mons and update the monmap. I believe
> that process is in the disaster recovery section of the docs.

Thank you.

Alex

> -Greg
>
>>
>>
>> --
>> Alex Gorbachev
>> Storcium
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com