Re: Split-brain in a multi-site cluster

Joao Eduardo Luis <joao@xxxxxxx> · Fri, 3 Feb 2017 11:51:14 +0000

On 02/02/2017 04:01 PM, Ilia Sokolinski wrote:
Hi,

We are testing a multi-site CEPH cluster using 0.94.5 release.
There are 2 sites with 2 CEPH nodes in each site.
Each node is running a monitor and a bunch of OSDs.
The CRUSH rules are configured to require a copy of data in each site.
The sites are connected by a private high-speed link.
In addition, there is a 5th monitor placed in AWS, and both sites have AWS connectivity.

We are doing a split-brain testing where we use iptables to simulate a cut in the link between two sites.
However, the connectivity to AWS from both sites is not affected in this test.

The expected behavior would be that one site will go down, but the other one will continue to function.

The observed behavior is as follows:

The monitors behave as expected:
   2 monitors in one site are declared dead, and 2 monitors + AWS monitor form a new quorum

The OSDs do not behave well:
From the logs, each OSD can’t heartbeat to any OSD in the other site. This is expected.

However, the OSDs on “dead” site are not declared “down”.
Some of them go down and then back up, but mostly they stay up.

As result, all PGs are stuck in the “peering” state, and the cluster is not usable - no clients can do any reads or writes in either site.

Is this expected?

Unfortunately, yes.

This is because your OSDs are still able to reach one of the monitors in 
the quorum: the aws monitor.

What's happening here is that the OSDs on one site are assuming the OSDs 
on the other side as down, because their heartbeats cannot reach them.

Both sides report the other side as down to the monitors.

Both sides can reach a monitor in the quorum (the aws monitor) and, as 
such, both sides are reporting the other down.

The monitor takes those reports and, eventually, mark some osds down on 
the osdmap, and pushes the new osdmap to the osds. The OSDs that have 
been marked down, upon seeing they're not really down because they can 
still reach a monitor, will tell the monitors they've been wrongly 
marked down and boot up.

This is the sort of flapping you're currently seeing.

There was a patch on newer versions that prevent this from happening, 
but I believe it may not serve you well, as it broadens the failure 
domain considered for these reports from what you have (osd) to host (by 
default, or something else if you so choose). This could help with a 
similar scenario in which a whole host is cut out from the other hosts, 
or (similarly) a rack. I honestly have no clue if this would behave 
properly if set to a whole datacenter.

For reference, the config option would be `mon_osd_reporter_subtree_level`.

  -Joao
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com