Re: [Linux-cluster] 10 Node Installation - Loosing Heartbeat

Michael Conrad Tadpol Tilstra <mtilstra@xxxxxxxxxx> · Thu, 3 Feb 2005 08:24:00 -0600

On Thu, Feb 03, 2005 at 02:53:08PM +0200, Richard Mayhew wrote:
> The problem I am experiencing is as follows.
>  
> Once the GFS system has been running for a few hours with some usage on
> each of the servers, some of the servers start missing beats. I
> increased the heartbeat rate to test every 60 seconds and to fail after
> 10 tries. This just prolonged the servers being fenced. The only thing I
> can come up with is that the locking server is buggy and stops
> responding to heartbeats. On the master server when it detects that the
> server has skipped the required number of beats, it tries to fence it
> and fails. I have setup the fencing to use the mcdata module and I have
> specified the correct login details. When the server that was fenced has
> had its lock server restarted it tries to relog in to the master lock
> server. This fails for obvious reasons as the master will refuse to
> allow it to reconnect due to the previous fencing failures. Manual
> fencing works without any problems but I have only tried this on the cmd
> line.
>  
> Does anyone have an idea as to why the locking servers are hanging up
> when it comes to sending heartbeat beats and possibly why the fencing
> isnt working?

First test the mcdata fencing.  Get your configs installed and ccsd
running.  Then run `fence_node store-01.mc.mweb.net`  (or one of the
other nodes).  Make sure that this has infact work.  (look on the switch
and what not.  I don't know anything about the mcdata, so I cannot say
much more here.)

This will let you test and see how fencing is working.  

Don't continue until you can call fence_node for every node and it does
in fact fence the node.

So, on to the missed heartbeats.  First, what is 'some usage'.  Not that
it should matter much, but just as an example, I can get missed
heartbeats by syncing large (~1G or so) amounts of data to the internal
drive.  (but to miss 11 at 60s apiece, it probably not this.)

Also, are there any messages from the Master or the Clients about the
time they start missing heartbeats? (other than the missed heartbeat
messages.)  If so, might give some clues as to what's happening.

Best thing to do when debugging heartbeats is to turn on those messages.
So add cluster.ccs:cluster{ lock_gulm { verbosity = "Default,Heartbeat" } }
to your config, and run things again.  (might also want to turn the
heartbeat rate back down for this.)  There will now be messages in
syslog for every heartbeat sent and received.

Hopefully this will unveil something.
-- 
Michael Conrad Tadpol Tilstra
To understand recursion, you must first understand recursion.
Attachment:
pgpwGZsYi7A1X.pgp

Description: PGP signature