On Thu, Jul 21, 2005 at 11:51:21PM -0400, Dan B. Phung wrote: > My cluster went down pretty hard, in that I had to hard reboot several > machines, and now the fence daemon won't come up. I run: > > $ ccsd && cman_tool join -w > $ fence_tool join -w -j 15 -D > blade02:~ # fence_tool join -w -D -j 15 > fence_tool: wait for quorum 1 > fence_tool: get our node name > fence_tool: connect to ccs > fence_tool: start fenced > fenced: 1122003465 our name from cman "blade02" This is inconsistent with the data below which shows that blade1 is a cluster member, not blade2. Maybe you collected the other data before blade2 joined the cluster... > blade13:~ # cman_tool nodes > Node Votes Exp Sts Name > 1 1 1 M blade01 > 2 1 1 X blade02 > 3 1 1 X blade03 > 4 1 1 X blade04 > 6 1 1 X blade06 > 7 1 1 X blade07 > 8 1 1 X blade08 > 9 1 1 X blade09 > 10 1 1 X blade10 > 11 1 1 X blade11 > 12 1 1 X blade12 > 13 1 1 M blade13 > 14 1 1 X blade14 > > blade13:~ # cman_tool status > Protocol version: 5.0.1 > Config version: 1 > Cluster name: blade_cluster > Cluster ID: 38068 > Cluster Member: Yes > Membership state: Cluster-Member > Nodes: 2 > Expected_votes: 1 > Total_votes: 2 > Quorum: 2 > Active subsystems: 6 > Node name: blade13 > > blade13:~ # cman_tool services > Service Name GID LID State Code > Fence Domain: "default" 1 2 recover 2 - > [13] This looks like blade13 is trying to fence some node. blade13 won't let anyone else join the fence domain until it's completed the fencing; this is probably why fenced on blade02 isn't getting anywhere. /var/log/messages on blade13 should show where or if there's an incomplete fencing operation. Dave -- Linux-cluster@xxxxxxxxxx http://www.redhat.com/mailman/listinfo/linux-cluster