Nevermind. This was all due to incorrect time on a couple
of the nodes. One node was in the past, and one was in the
future.
It may be beneficial to fix this as it DOES cause a kernel
panic. Maybe add some kind of time sync check to disallow a node from joining
when its time isn't within X of the cluster.
Robert
Gil
Linux Systems
Administrator
American Home Mortgage
Phone:
631-622-8410
Cell: 631-827-5775
Fax:
516-495-5861
From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Robert Gil
Sent: Tuesday, May 22, 2007 11:49 AM
To: linux-cluster@xxxxxxxxxx
Subject: Strange Behavior
I am getting some
strange behavior on a 4 node cluster. When node dbs2 tries to connect to
the cluster, node app3 either kernel panics or ccsd and rgmanager crash.
Node dbs2 says that the heartbeats drop off and it goes to remove itself
from the cluster. I am curious why node app3 would crash, and what these SM
messages are. Also why node dbs2 would connect to the cluster, become
quorate, and then drop off and crash node 1. Has anyone seen this
before?
/var/log/messages
May 22 11:34:36 melqsjssapp03 kernel: CMAN: node
melqsjssdbs02.americanhm.com rejoining
May 22 11:35:11 melqsjssapp03 kernel: CMAN: node melqsjssdbs02.americanhm.com has been removed from the cluster : Missed too many heartbeats
May 22 11:35:25 melqsjssapp03 kernel: CMAN: node melqsjssapp03.americanhm.com has been removed from the cluster : No response to messages
May 22 11:35:25 melqsjssapp03 kernel: CMAN: killed by NODEDOWN message
May 22 11:35:25 melqsjssapp03 kernel: CMAN: we are leaving the cluster. No response to messages
May 22 11:35:25 melqsjssapp03 kernel: WARNING: dlm_emergency_shutdown
May 22 11:35:25 melqsjssapp03 kernel: WARNING: dlm_emergency_shutdown
May 22 11:35:25 melqsjssapp03 kernel: SM: 00000011 sm_stop: SG still joined
May 22 11:35:25 melqsjssapp03 kernel: SM: 01000014 sm_stop: SG still joined
May 22 11:35:25 melqsjssapp03 kernel: SM: 0200001a sm_stop: SG still joined
May 22 11:35:25 melqsjssapp03 kernel: SM: 03000002 sm_stop: SG still joined
May 22 11:35:25 melqsjssapp03 clurgmgrd[5179]: <warning> #67: Shutting down uncleanly
May 22 11:35:25 melqsjssapp03 ccsd[4630]: Cluster manager shutdown. Attemping to reconnect...
May 22 11:35:51 melqsjssapp03 ccsd[4630]: Unable to connect to cluster infrastructure after 30 seconds.
May 22 11:36:21 melqsjssapp03 ccsd[4630]: Unable to connect to cluster infrastructure after 60 seconds.
May 22 11:35:11 melqsjssapp03 kernel: CMAN: node melqsjssdbs02.americanhm.com has been removed from the cluster : Missed too many heartbeats
May 22 11:35:25 melqsjssapp03 kernel: CMAN: node melqsjssapp03.americanhm.com has been removed from the cluster : No response to messages
May 22 11:35:25 melqsjssapp03 kernel: CMAN: killed by NODEDOWN message
May 22 11:35:25 melqsjssapp03 kernel: CMAN: we are leaving the cluster. No response to messages
May 22 11:35:25 melqsjssapp03 kernel: WARNING: dlm_emergency_shutdown
May 22 11:35:25 melqsjssapp03 kernel: WARNING: dlm_emergency_shutdown
May 22 11:35:25 melqsjssapp03 kernel: SM: 00000011 sm_stop: SG still joined
May 22 11:35:25 melqsjssapp03 kernel: SM: 01000014 sm_stop: SG still joined
May 22 11:35:25 melqsjssapp03 kernel: SM: 0200001a sm_stop: SG still joined
May 22 11:35:25 melqsjssapp03 kernel: SM: 03000002 sm_stop: SG still joined
May 22 11:35:25 melqsjssapp03 clurgmgrd[5179]: <warning> #67: Shutting down uncleanly
May 22 11:35:25 melqsjssapp03 ccsd[4630]: Cluster manager shutdown. Attemping to reconnect...
May 22 11:35:51 melqsjssapp03 ccsd[4630]: Unable to connect to cluster infrastructure after 30 seconds.
May 22 11:36:21 melqsjssapp03 ccsd[4630]: Unable to connect to cluster infrastructure after 60 seconds.
Thanks,
Robert
Gil
Linux Systems
Administrator
American Home Mortgage
Phone:
631-622-8410
Cell: 631-827-5775
Fax:
516-495-5861
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster