Re: Split-brain in a multi-site cluster

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 02/02/2017 04:01 PM, Ilia Sokolinski wrote:
Hi,

We are testing a multi-site CEPH cluster using 0.94.5 release.
There are 2 sites with 2 CEPH nodes in each site.
Each node is running a monitor and a bunch of OSDs.
The CRUSH rules are configured to require a copy of data in each site.
The sites are connected by a private high-speed link.
In addition, there is a 5th monitor placed in AWS, and both sites have AWS connectivity.

We are doing a split-brain testing where we use iptables to simulate a cut in the link between two sites.
However, the connectivity to AWS from both sites is not affected in this test.

The expected behavior would be that one site will go down, but the other one will continue to function.

The observed behavior is as follows:

The monitors behave as expected:
   2 monitors in one site are declared dead, and 2 monitors + AWS monitor form a new quorum

The OSDs do not behave well:
From the logs, each OSD can’t heartbeat to any OSD in the other site. This is expected.

However, the OSDs on “dead” site are not declared “down”.
Some of them go down and then back up, but mostly they stay up.

As result, all PGs are stuck in the “peering” state, and the cluster is not usable - no clients can do any reads or writes in either site.

Is this expected?

Unfortunately, yes.

This is because your OSDs are still able to reach one of the monitors in the quorum: the aws monitor.

What's happening here is that the OSDs on one site are assuming the OSDs on the other side as down, because their heartbeats cannot reach them.

Both sides report the other side as down to the monitors.

Both sides can reach a monitor in the quorum (the aws monitor) and, as such, both sides are reporting the other down.

The monitor takes those reports and, eventually, mark some osds down on the osdmap, and pushes the new osdmap to the osds. The OSDs that have been marked down, upon seeing they're not really down because they can still reach a monitor, will tell the monitors they've been wrongly marked down and boot up.

This is the sort of flapping you're currently seeing.

There was a patch on newer versions that prevent this from happening, but I believe it may not serve you well, as it broadens the failure domain considered for these reports from what you have (osd) to host (by default, or something else if you so choose). This could help with a similar scenario in which a whole host is cut out from the other hosts, or (similarly) a rack. I honestly have no clue if this would behave properly if set to a whole datacenter.

For reference, the config option would be `mon_osd_reporter_subtree_level`.


  -Joao
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux