Re: [Linux-cluster] node kicked out of cluster

Daniel McNeil <daniel@xxxxxxxx> · Wed, 23 Feb 2005 15:12:22 -0800

On Tue, 2005-02-22 at 00:35, Patrick Caulfield wrote:
> On Mon, Feb 21, 2005 at 05:34:23PM -0800, Daniel McNeil wrote:
> > My latest test ran 49 hours before a node got kicked out.
> > 
> > 
> > cl030:
> > Feb 18 18:07:40 cl030 kernel: CMAN: node cl030a has been removed from the cluster : No response to messages
> > Feb 18 18:07:40 cl030 kernel: CMAN: killed by NODEDOWN message
> > Feb 18 18:07:40 cl030 kernel: CMAN: we are leaving the cluster.
> > Feb 18 18:07:41 cl030 kernel: dlm: stripefs: recoverd_kick after exit
> > Feb 18 18:07:41 cl030 kernel:
> > Feb 18 18:07:41 cl030 kernel: SM: send_nodeid_message error -107 to 2
> > Feb 18 18:07:42 cl030 kernel: SM: 00000001 sm_stop: SG still joined
> > Feb 18 18:07:42 cl030 kernel: SM: 01000430 sm_stop: SG still joined
> > Feb 18 18:07:42 cl030 kernel: SM: 02000431 sm_stop: SG still joined
> > Feb 18 18:07:42 cl030 ccsd[3766]: [cluster_mgr.c:387] Cluster manager shutdown.
> > 
> > cl031:
> > Feb 18 18:07:40 cl031 kernel: CMAN: removing node cl030a from the cluster : No response to messages
> > Feb 18 18:07:41 cl031 fenced[4127]: cl030a not a cluster member after 0 sec post_fail_delay
> > Feb 18 18:07:41 cl031 fenced[4127]: fencing node "cl030a"
> > Feb 18 18:07:41 cl031 fence_manual: Node cl030a needs to be reset before recovery can procede.  Waiting for cl030a to rejoin the cluster or for manual acknowledgement that it has been reset (i.e. fence_ack_manual -n cl030a)
> > 
> > cl032:
> > Feb 18 18:07:40 cl032 kernel: CMAN: node cl030a has been removed from the cluster : No response to messages
> > Feb 18 18:07:41 cl032 fenced[4262]: fencing deferred to cl031a
> > Feb 19 04:02:06 cl032 su(pam_unix)[29639]: session opened for user cyrus by (uid=0)
> > 
> > Does this mean heartbeats got lost so cl030 was kicked out?
> 
> No. "No response to messages" can only happen during a state transition or
> services join/leave. Current thinking is that the DLM can hog the CPU when
> recovering huge numbers of locks, so we a re looking into placing some strategic
> "schedule()" calls in the recovery process.

My test is doing a bunch of mount/umount's, so that is causing
the service join/leave for the DLM lock space and file system
mount group.  

Are you saying leaving a DLM lock space causes a DLM recovery and
 that is what is leading to the 'No response to messages' ?

How does DLM hogging a cpu lead to 'no response'?
BTW, this is running on 2 proc machines, so hogging one cpu
still leaves on available.  

Who is not responding to what message?

Thanks,

Daniel