Split-brain in a multi-site cluster

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

We are testing a multi-site CEPH cluster using 0.94.5 release.
There are 2 sites with 2 CEPH nodes in each site. 
Each node is running a monitor and a bunch of OSDs.
The CRUSH rules are configured to require a copy of data in each site.
The sites are connected by a private high-speed link.
In addition, there is a 5th monitor placed in AWS, and both sites have AWS connectivity.

We are doing a split-brain testing where we use iptables to simulate a cut in the link between two sites.
However, the connectivity to AWS from both sites is not affected in this test.

The expected behavior would be that one site will go down, but the other one will continue to function.

The observed behavior is as follows:

The monitors behave as expected:
   2 monitors in one site are declared dead, and 2 monitors + AWS monitor form a new quorum

The OSDs do not behave well:
>From the logs, each OSD can’t heartbeat to any OSD in the other site. This is expected.

However, the OSDs on “dead” site are not declared “down”.
Some of them go down and then back up, but mostly they stay up.

As result, all PGs are stuck in the “peering” state, and the cluster is not usable - no clients can do any reads or writes in either site.

Is this expected?
Are there any parameters that can be changed to improve the behavior?

Thanks

Ilia Sokolinski



_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux