Re: Two datacenter resilient design with a quorum site

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 18 Jan 2018 13:46:04 -0800



On Thu, Jan 18, 2018 at 5:57 AM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote:
> On Tue, Jan 16, 2018 at 2:17 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>> On Tue, Jan 16, 2018 at 6:07 AM Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx>
>> wrote:
>>>
>>> I found a few WAN RBD cluster design discussions, but not a local one,
>>> so was wonderinng if anyone has experience with a resilience-oriented
>>> short distance (<10 km, redundant fiber connections) cluster in two
>>> datacenters with a third site for quorum purposes only?
>>>
>>> I can see two types of scenarios:
>>>
>>> 1. Two (or even number) of OSD nodes at each site, 4x replication
>>> (size 4, min_size 2).  Three MONs, one at each site to handle split
>>> brain.
>>>
>>> Question: How does the cluster handle the loss of communication
>>> between the OSD sites A and B, while both can communicate with the
>>> quorum site C?  It seems, one of the sites should suspend, as OSDs
>>> will not be able to communicate between sites.
>>
>>
>> Sadly this won't work — the OSDs on each side will report their peers on the
>> other side down, but both will be able to connect to a live monitor.
>> (Assuming the quorum site holds the leader monitor, anyway — if one of the
>> main sites holds what should be the leader, you'll get into a monitor
>> election storm instead.) You'll need your own netsplit monitoring to shut
>> down one site if that kind of connection cut is a possibility.
>
> What about running a split brain aware too, such as Pacemaker, and
> running a copy of the same VM as a mon at each site?  In case of a
> split brain network separation, Pacemaker would (aware via third site)
> stop the mon on site A and bring up the mon on site B (or whatever the
> rules are set to).  I read earlier that a mon with the same IP, name
> and keyring would just look to the ceph cluster as a very old mon, but
> still able to vote for quorum.

It probably is, but don't do that: just use your network monitoring to
shut down the site you've decided is less important. No need to try
and replace its monitor on the primary site or anything like that. (It
would leave you with a mess when trying to restore the secondary
site!)
If you're worried about handling an additional monitor failures, you
can do two per site (plus quorum tiebreaker).
-Greg

>
> Vincent Godin also described an HSRP based method, which would
> accomplish this goal via network routing.  That seems like a good
> approach too, I just need to check on HSRP availability.
>
>>
>>>
>>>
>>> 2. 3x replication for performance or cost (size 3, min_size 2 - or
>>> even min_size 1 and strict monitoring).  Two replicas and two MONs at
>>> one site and one replica and one MON at the other site.
>>>
>>> Question: in case of a permanent failure of the main site (with two
>>> replicas), how to manually force the other site (with one replica and
>>> MON) to provide storage?  I would think a CRUSH map change and
>>> modifying ceph.conf to include just one MON, then build two more MONs
>>> locally and add?
>>
>>
>> Yep, pretty much that. You won't need to change ceph.conf to just one mon so
>> much as to include the current set of mons and update the monmap. I believe
>> that process is in the disaster recovery section of the docs.
>
> Thank you.
>
> Alex
>
>> -Greg
>>
>>>
>>>
>>> --
>>> Alex Gorbachev
>>> Storcium
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com