Re: Network failure scenarios

Sage Weil <sage@xxxxxxxxxxx> · Thu, 22 Aug 2013 21:48:18 -0700 (PDT)

On Fri, 23 Aug 2013, Keith Phua wrote:
> Hi,
> 
> It was mentioned in the devel mailing list that for 2 networks setup, if 
> the cluster network failed, the cluster behave pretty badly. Ref: 
> http://article.gmane.org/gmane.comp.file-systems.ceph.devel/12285/match=cluster+network+fail
> 
> May I know if this problem still exist in cuttlefish or dumpling?

This is fixed in dumpling.  When an osd is marked down, it verifies that 
it is able to connect to other hosts on both its public and cluster 
network before trying to add itself back into the cluster.

> If I have 2 racks of servers in a cluster and a total of 5 mons. Rack1 
> contains 3 mons, 120 osds and rack2 contains 2 mons, 120 osds. In a 2 
> networks setup, May I know what will happen when the following problem 
> occurs:
> 
> 1. Public network links between rack1 and rack2 failed resulting rack1 
> mons uncontactable with rack2 mons. osds of both racks still connected.  
> Will the cluster see it as 2 out of 5 mons failed or 3 out of 5 mons 
> failed?

This is a classic partition.  One rack will see 3 working and 2 failed 
mons, and the cluster will appear "up".  The other rack will see 2 working 
and 3 failed mons, and will be effectively down.

> 2. Cluster network links between rack1 and rack2 failed resulting osds 
> in rack1 and osds in rack2 disconnected as mentioned above.

Here all the mons are available.  OSDs will get marked down by peers in 
the opposite rack because the cluster network link has failed.  They will 
only try to mark themselves back up if they are able to reach 1/3 of their 
peers.  This value is currently hard-coded; we can easily make it tunable.  
(https://github.com/ceph/ceph/pull/533)

> 3. Both network links between rack1 and rack2 failed. Split-brain seems 
> to occur.  Will the cluster halt? Or rack 1 starts to self-healed and 
> replicate data in rack1 since rack1 will have 3 mons out of 5 mons 
> working?

This is really the same as 1.  Only the half with a majority of 
communicating monitors will be 'up'; the other part of the cluster will 
not be allowed to do anything.

sage

> In the above scenarios, all links within the rack are all working.
> 
> Your valuable comments are greatly appreciated.
> 
> Keith
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com