RE: [Linux-cluster] GFS 6.0 node without quorum tries to fence

"Schumacher, Bernd" <bernd.schumacher@xxxxxx> · Wed, 4 Aug 2004 16:06:32 +0200

> -----Original Message-----
> From: linux-cluster-bounces@xxxxxxxxxx 
> [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of AJ Lewis
> Sent: Mittwoch, 4. August 2004 15:54
> To: Discussion of clustering software components including GFS
> Subject: Re: [Linux-cluster] GFS 6.0 node without quorum 
> tries to fence
> 
> 
> On Wed, Aug 04, 2004 at 08:12:51AM +0200, Schumacher, Bernd wrote:
> > So, what I have learned from all answers is very bad news 
> for me. It 
> > seems, what happened is as expected by most of you. But this means:
> > 
> > 
> ----------------------------------------------------------------------
> > -
> > --- One single point of failure in one node can stop the 
> whole gfs. ---
> > 
> --------------------------------------------------------------
> ---------
> > 
> > The single point of failure is:
> > The lancard specified in "nodes.ccs:ip_interfaces" stops working on 
> > one node. No matter if this node was master or slave.
> > 
> > The whole gfs is stopped:
> > The rest of the cluster seems to need time to form a new 
> cluster. The 
> > bad node does not need so much time for switching to 
> arbitrary mode. 
> > So the bad node has enough time to fence all other nodes, before it 
> > would be fenced by the new master.
> > 
> > The bad node lives but it can not form a cluster. GFS is 
> not working.
> > 
> > Now all other nodes will reboot. But after reboot they can not join 
> > the cluster, because they can not contact the bad node. The 
> lancard is 
> > still broken. GFS is not working.
> > 
> > Did I miss something?
> > Please tell me that I am wrong!
> 
> Well, I guess I'm confused how the node with the bad lan card 
> can contact the fencing device to fence the other nodes.  If 
> it can't communicate with the other nodes because it's NIC is 
> down, it can't contact the fencing device over that NIC 
> either, right?  Or are you using some alternate transport to 
> contact the fencing device? 

There is a second admin Lan which is used for fencing.

Could I probably use this second admin Lan for GFS Heartbeats too. Can I
define two LAN-Cards in "nodes.ccs:ip_interfaces". If this works I would
not have a single point of failure anymore. But the documentation seems
not to allow this.
I will test this tomorrow.

>  
> > > -----Original Message-----
> > > From: linux-cluster-bounces@xxxxxxxxxx
> > > [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of 
> > > Schumacher, Bernd
> > > Sent: Dienstag, 3. August 2004 13:56
> > > To: linux-cluster@xxxxxxxxxx
> > > Subject: [Linux-cluster] GFS 6.0 node without quorum 
> tries to fence
> > > 
> > > 
> > > Hi,
> > > I have three nodes oben, mitte and unten.
> > > 
> > > Test:
> > > I have disabled eth0 on mitte, so that mitte will be excluded.
> > > 
> > > Result:
> > > Oben and unten are trying to fence mitte and build a new
> > > cluster. OK! But mitte tries to fence oben and unten. PROBLEM!
> > >  
> > > Why can this happen? Mitte knows that it can not build a
> > > cluster. See Logfile from mitte: "Have 1, need 2"
> > > 
> > > Logfile from mitte:
> > > Aug  3 12:53:17 mitte lock_gulmd_core[1845]: Client (oben)
> > > expired Aug 3 12:53:17 mitte lock_gulmd_core[1845]: Core lost 
> > > slave quorum. Have 1, need 2. Switching to Arbitrating. Aug  
> > > 3 12:53:17 mitte
> > > lock_gulmd_core[2120]: Gonna exec fence_node oben Aug  3 
> > > 12:53:17 mitte
> > > lock_gulmd_core[1845]: Forked [2120] fence_node oben with a 0 
> > > pause. Aug 3 12:53:17 mitte fence_node[2120]: Performing 
> > > fence method, manual, on oben. 
> > > 
> > > cluster.ccs:
> > > cluster {
> > >     name = "tom"
> > >     lock_gulm {
> > >         servers = ["oben", "mitte", "unten"]
> > >     }
> > > }
> > > 
> > > fence.ccs:
> > > fence_devices {
> > >   manual_oben {
> > >     agent = "fence_manual"
> > >   }     
> > >   manual_mitte ...
> > > 
> > > 
> > > nodes.ccs:
> > > nodes {
> > >   oben {
> > >     ip_interfaces {
> > >       eth0 = "192.168.100.241"
> > >     }
> > >     fence { 
> > >       manual {
> > >         manual_oben {
> > >           ipaddr = "192.168.100.241"
> > >         }
> > >       }
> > >     }
> > >   }
> > >   mitte ...
> > > 
> > > regards
> > > Bernd Schumacher
> > > 
> > > --
> > > 
> > > Linux-cluster@xxxxxxxxxx
> > > http://www.redhat.com/mailman/listinfo/linux-> cluster
> > > 
> > 
> > --
> > 
> > Linux-cluster@xxxxxxxxxx 
> > http://www.redhat.com/mailman/listinfo/linux-cluster
> 
> -- 
> AJ Lewis                                   Voice:  612-638-0500
> Red Hat Inc.                               E-Mail: alewis@xxxxxxxxxx
> 720 Washington Ave. SE, Suite 200
> Minneapolis, MN 55414
> 
> Current GPG fingerprint = D9F8 EDCE 4242 855F A03D  9B63 F50C 
> 54A8 578C 8715 Grab the key at: 
> http://people.redhat.com/alewis/gpg.html or > one of the many 
> keyservers out there... -----Begin Obligatory Humorous 
> Quote----------------------------------------
> "In this time of war against Osama bin Laden and the 
> oppressive Taliban regime, we are thankful that OUR leader 
> isn't the spoiled son of a powerful politician from a wealthy 
> oil family who is supported by religious fundamentalists, 
> operates through clandestine organizations, has no respect 
> for the democratic electoral process, bombs innocents, and 
> uses war to deny people their civil liberties." --The 
> Boondocks -----End Obligatory Humorous 
> Quote------------------------------------------
>