Re: Network failure scenarios

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 23 Aug 2013, Keith Phua wrote:
> 
> 
> ----- Original Message -----
> > From: "Sage Weil" <sage@xxxxxxxxxxx>
> > To: "Keith Phua" <keith@xxxxxxxxxxxxxxxxxx>
> > Cc: ceph-users@xxxxxxxx
> > Sent: Friday, August 23, 2013 12:48:18 PM
> > Subject: Re:  Network failure scenarios
> > 
> > On Fri, 23 Aug 2013, Keith Phua wrote:
> > > Hi,
> > > 
> > > It was mentioned in the devel mailing list that for 2 networks setup, if
> > > the cluster network failed, the cluster behave pretty badly. Ref:
> > > http://article.gmane.org/gmane.comp.file-systems.ceph.devel/12285/match=cluster+network+fail
> > > 
> > > May I know if this problem still exist in cuttlefish or dumpling?
> > 
> > This is fixed in dumpling.  When an osd is marked down, it verifies that
> > it is able to connect to other hosts on both its public and cluster
> > network before trying to add itself back into the cluster.
> >  
> 
> Alright, that's great!
> 
> > > If I have 2 racks of servers in a cluster and a total of 5 mons. Rack1
> > > contains 3 mons, 120 osds and rack2 contains 2 mons, 120 osds. In a 2
> > > networks setup, May I know what will happen when the following problem
> > > occurs:
> > > 
> > > 1. Public network links between rack1 and rack2 failed resulting rack1
> > > mons uncontactable with rack2 mons. osds of both racks still connected.
> > > Will the cluster see it as 2 out of 5 mons failed or 3 out of 5 mons
> > > failed?
> > 
> > This is a classic partition.  One rack will see 3 working and 2 failed
> > mons, and the cluster will appear "up".  The other rack will see 2 working
> > and 3 failed mons, and will be effectively down.
> 
> For this scenarios, since it's only the public network link between the 
> 2 racks is downed but cluster network between the racks is still up, 
> will the cluster treat it as "up" with 2 mons down?  Will the rack 2 
> still be effectively down?

The side with the 3 mons will treat it as up, yes.  The public network is 
used to communicate with the mons.  The other rack will be effectively 
offline, since it can't reach monitors in quorum.  The osds in the up rack 
will mark all of the other rack's osds down because the public interface 
isn't reachable (both the front and backside are pinged).

> > > 2. Cluster network links between rack1 and rack2 failed resulting osds
> > > in rack1 and osds in rack2 disconnected as mentioned above.
> > 
> > Here all the mons are available.  OSDs will get marked down by peers in
> > the opposite rack because the cluster network link has failed.  They will
> > only try to mark themselves back up if they are able to reach 1/3 of their
> > peers.  This value is currently hard-coded; we can easily make it tunable.
> > (https://github.com/ceph/ceph/pull/533)
> 
> So in this case if the crushmap is configured to distribute data across 
> hosts, those OSDs who are able to reach their peers within the same rack 
> will be up and those peers across the rack will be marked down.  So will 
> the OSDs whose peers are across the racks start to self-heal and 
> replicate within the same rack after sometime?

It depends.  There is a tunable

 mon_osd_down_out_subtree_limit = rack

that will prevent the mon from marking things out (triggering healing) 
if an entire piece of the subtree is down.  By default this is set to 
rack, so if an entire rack is down the system won't heal (we assume it is 
a temporary event that an admin will fix).  So by default in your case no 
healing will happen.  You could change this to host, though.

> > > 3. Both network links between rack1 and rack2 failed. Split-brain seems
> > > to occur.  Will the cluster halt? Or rack 1 starts to self-healed and
> > > replicate data in rack1 since rack1 will have 3 mons out of 5 mons
> > > working?
> > 
> > This is really the same as 1.  Only the half with a majority of
> > communicating monitors will be 'up'; the other part of the cluster will
> > not be allowed to do anything.
> > 
> 
> Does it mean also the half with a majority mons that are up will start 
> to self-heal and replicate the data within that rack after sometime and 
> if the rack is near full, the 'cluster' will halt?

See above.. by default, we don't try to heal when an entire rack is down.

sage

> 
> Thanks Sage!
> 
> > sage
> > 
> > > In the above scenarios, all links within the rack are all working.
> > > 
> > > Your valuable comments are greatly appreciated.
> > > 
> > > Keith
> > > 
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > 
> > > 
> > 
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux