Re: [Linux-cluster] GFS 6.0 node without quorum tries to fence

Adam Manthei <amanthei@xxxxxxxxxx> · Wed, 4 Aug 2004 09:20:22 -0500

On Wed, Aug 04, 2004 at 04:06:32PM +0200, Schumacher, Bernd wrote:
> > > The single point of failure is:
> > > The lancard specified in "nodes.ccs:ip_interfaces" stops working on 
> > > one node. No matter if this node was master or slave.
> > > 
> > > The whole gfs is stopped:
> > > The rest of the cluster seems to need time to form a new cluster. The 
> > > bad node does not need so much time for switching to 
> > > arbitrary mode.  So the bad node has enough time to fence all other 
> > > nodes, before it would be fenced by the new master.
> > > 
> > > The bad node lives but it can not form a cluster. GFS is not working.
> > > 
> > > Now all other nodes will reboot. But after reboot they can not join 
> > > the cluster, because they can not contact the bad node. The 
> > > lancard is still broken. GFS is not working.
> > > 
> > > Did I miss something?
> > > Please tell me that I am wrong!
> > 
> > Well, I guess I'm confused how the node with the bad lan card 
> > can contact the fencing device to fence the other nodes.  If 
> > it can't communicate with the other nodes because it's NIC is 
> > down, it can't contact the fencing device over that NIC 
> > either, right?  Or are you using some alternate transport to 
> > contact the fencing device? 
> 
> There is a second admin Lan which is used for fencing.
>  
> Could I probably use this second admin Lan for GFS Heartbeats too. Can I
> define two LAN-Cards in "nodes.ccs:ip_interfaces". If this works I would
> not have a single point of failure anymore. But the documentation seems
> not to allow this.
> I will test this tomorrow.

GULM does not support multiple ethernet devices.  In this case, you would
want to architect your network so that the fence devices are on the same
network as the heartbeats.

However, if you did _NOT_ do that, the problem isn't as bad as you make it out
to be.  You're correct in thinking that there will be a shootout.  One of
your gulm servers will try to hence the others, and the others will try to
fence the one.  When the smoke clears, you will at worst be left with a
single server.  If that remaining server can no longer talk to the other
lock_gulmd servers due to a net split, it will continue to sit in the
arbitrating state waiting for the other nodes to login.  The other nodes
however will be able to start a new generation of the cluster when they
restart because they will be quorate.  If the other quorate part of the
netsplit wins the shootout, you only loose the one node.

If this is not acceptable, then you really need to rethink why the
heartbeats are not going over the same interface as the fencing device.

-Adam

> > > > -----Original Message-----
> > > > From: linux-cluster-bounces@xxxxxxxxxx
> > > > [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of 
> > > > Schumacher, Bernd
> > > > Sent: Dienstag, 3. August 2004 13:56
> > > > To: linux-cluster@xxxxxxxxxx
> > > > Subject: [Linux-cluster] GFS 6.0 node without quorum 
> > tries to fence
> > > > 
> > > > 
> > > > Hi,
> > > > I have three nodes oben, mitte and unten.
> > > > 
> > > > Test:
> > > > I have disabled eth0 on mitte, so that mitte will be excluded.
> > > > 
> > > > Result:
> > > > Oben and unten are trying to fence mitte and build a new
> > > > cluster. OK! But mitte tries to fence oben and unten. PROBLEM!
> > > >  
> > > > Why can this happen? Mitte knows that it can not build a
> > > > cluster. See Logfile from mitte: "Have 1, need 2"
> > > > 
> > > > Logfile from mitte:
> > > > Aug  3 12:53:17 mitte lock_gulmd_core[1845]: Client (oben)
> > > > expired Aug 3 12:53:17 mitte lock_gulmd_core[1845]: Core lost 
> > > > slave quorum. Have 1, need 2. Switching to Arbitrating. Aug  
> > > > 3 12:53:17 mitte
> > > > lock_gulmd_core[2120]: Gonna exec fence_node oben Aug  3 
> > > > 12:53:17 mitte
> > > > lock_gulmd_core[1845]: Forked [2120] fence_node oben with a 0 
> > > > pause. Aug 3 12:53:17 mitte fence_node[2120]: Performing 
> > > > fence method, manual, on oben. 
> > > > 
> > > > cluster.ccs:
> > > > cluster {
> > > >     name = "tom"
> > > >     lock_gulm {
> > > >         servers = ["oben", "mitte", "unten"]
> > > >     }
> > > > }
> > > > 
> > > > fence.ccs:
> > > > fence_devices {
> > > >   manual_oben {
> > > >     agent = "fence_manual"
> > > >   }     
> > > >   manual_mitte ...
> > > > 
> > > > 
> > > > nodes.ccs:
> > > > nodes {
> > > >   oben {
> > > >     ip_interfaces {
> > > >       eth0 = "192.168.100.241"
> > > >     }
> > > >     fence { 
> > > >       manual {
> > > >         manual_oben {
> > > >           ipaddr = "192.168.100.241"
> > > >         }
> > > >       }
> > > >     }
> > > >   }
> > > >   mitte ...
> > > > 
> > > > regards
> > > > Bernd Schumacher
> > > > 
> > > > --
> > > > 
> > > > Linux-cluster@xxxxxxxxxx
> > > > http://www.redhat.com/mailman/listinfo/linux-> cluster
> > > > 
> > > 
> > > --
> > > 
> > > Linux-cluster@xxxxxxxxxx 
> > > http://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > -- 
> > AJ Lewis                                   Voice:  612-638-0500
> > Red Hat Inc.                               E-Mail: alewis@xxxxxxxxxx
> > 720 Washington Ave. SE, Suite 200
> > Minneapolis, MN 55414
> > 
> > Current GPG fingerprint = D9F8 EDCE 4242 855F A03D  9B63 F50C 
> > 54A8 578C 8715 Grab the key at: 
> > http://people.redhat.com/alewis/gpg.html or > one of the many 
> > keyservers out there... -----Begin Obligatory Humorous 
> > Quote----------------------------------------
> > "In this time of war against Osama bin Laden and the 
> > oppressive Taliban regime, we are thankful that OUR leader 
> > isn't the spoiled son of a powerful politician from a wealthy 
> > oil family who is supported by religious fundamentalists, 
> > operates through clandestine organizations, has no respect 
> > for the democratic electoral process, bombs innocents, and 
> > uses war to deny people their civil liberties." --The 
> > Boondocks -----End Obligatory Humorous 
> > Quote------------------------------------------
> > 
> 
> --
> 
> Linux-cluster@xxxxxxxxxx
> http://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Adam Manthei  <amanthei@xxxxxxxxxx>