On 22, Jul, 2005, David Teigland declared: > On Thu, Jul 21, 2005 at 11:51:21PM -0400, Dan B. Phung wrote: > > My cluster went down pretty hard, in that I had to hard reboot several > > machines, and now the fence daemon won't come up. I run: > > > > $ ccsd && cman_tool join -w > > $ fence_tool join -w -j 15 -D > > blade02:~ # fence_tool join -w -D -j 15 > > fence_tool: wait for quorum 1 > > fence_tool: get our node name > > fence_tool: connect to ccs > > fence_tool: start fenced > > fenced: 1122003465 our name from cman "blade02" > > This is inconsistent with the data below which shows that blade1 is a > cluster member, not blade2. Maybe you collected the other data before > blade2 joined the cluster... right, actually I exited from the fence operation and force blade02 to leave the cluster. > This looks like blade13 is trying to fence some node. blade13 won't let > anyone else join the fence domain until it's completed the fencing; this > is probably why fenced on blade02 isn't getting anywhere. > /var/log/messages on blade13 should show where or if there's an incomplete > fencing operation. here's some excerpts from /var/log/messages: Jul 21 16:48:05 blade13 kernel: qla2300 0000:02:02.0: LOOP DOWN detected. Jul 21 16:48:37 blade13 kernel: SCSI error : <0 0 1 1> return code = 0x10000 Jul 21 16:48:37 blade13 kernel: end_request: I/O error, dev sdb, sector 69569288 Jul 21 16:48:37 blade13 kernel: SCSI error : <0 0 1 1> return code = 0x10000 Jul 21 16:48:37 blade13 kernel: end_request: I/O error, dev sdb, sector 69569296 Jul 21 16:48:37 blade13 kernel: SCSI error : <0 0 1 1> return code = 0x10000 Jul 21 16:48:37 blade13 kernel: end_request: I/O error, dev sdb, sector 69569304 Jul 21 16:48:37 blade13 kernel: SCSI error : <0 0 1 1> return code = 0x10000 Jul 21 16:48:37 blade13 kernel: end_request: I/O error, dev sdb, sector 69569312 Jul 21 16:48:37 blade13 kernel: SCSI error : <0 0 1 1> return code = 0x10000 Jul 21 16:48:37 blade13 kernel: end_request: I/O error, dev sdb, sector 69569320 Jul 21 16:48:37 blade13 kernel: SCSI error : <0 0 1 1> return code = 0x10000 Jul 21 16:48:37 blade13 kernel: end_request: I/O error, dev sdb, sector 69569328 Jul 21 16:48:37 blade13 kernel: SCSI error : <0 0 1 1> return code = 0x10000 Jul 21 16:48:37 blade13 kernel: end_request: I/O error, dev sdb, sector 69569336 Jul 21 16:48:37 blade13 kernel: GFS: fsid=blade_cluster:lil_cheesy1_lv.0: fatal: I/O error Jul 21 16:48:37 blade13 kernel: GFS: fsid=blade_cluster:lil_cheesy1_lv.0: block = 8696119 Jul 21 16:48:37 blade13 kernel: GFS: fsid=blade_cluster:lil_cheesy1_lv.0: function = gfs_logbh_wait Jul 21 16:48:37 blade13 kernel: GFS: fsid=blade_cluster:lil_cheesy1_lv.0: file = /usr/local/src/cluster-2.6.8.1/gfs-kernel/src/gfs/dio.c, line = 923 Jul 21 16:48:37 blade13 kernel: GFS: fsid=blade_cluster:lil_cheesy1_lv.0: time = 1121978916 Jul 21 16:48:37 blade13 kernel: GFS: fsid=blade_cluster:lil_cheesy1_lv.0: about to withdraw from the cluster Jul 21 16:48:37 blade13 kernel: GFS: fsid=blade_cluster:lil_cheesy1_lv.0: waiting for outstanding I/O Jul 21 16:48:37 blade13 kernel: GFS: fsid=blade_cluster:lil_cheesy1_lv.0: telling LM to withdraw Jul 21 16:48:37 blade13 kernel: lock_dlm: withdraw abandoned memory Jul 21 16:48:37 blade13 kernel: GFS: fsid=blade_cluster:lil_cheesy1_lv.0: withdrawn Jul 21 16:49:33 blade13 kernel: qla2300 0000:02:02.0: LOOP UP detected (2 Gbps). Jul 21 17:01:34 blade13 shutdown[7987]: shutting down for system reboot -- snipped reboot messages -- Jul 21 17:04:17 blade13 kernel: CMAN: Waiting to join or form a Linux-cluster Jul 21 17:04:20 blade13 kernel: CMAN: sending membership request Jul 21 17:04:21 blade13 kernel: CMAN: got node blade12 Jul 21 17:04:21 blade13 kernel: CMAN: got node blade04 Jul 21 17:04:21 blade13 kernel: CMAN: got node blade09 Jul 21 17:04:21 blade13 kernel: CMAN: got node blade03 Jul 21 17:04:21 blade13 kernel: CMAN: got node blade02 Jul 21 17:04:21 blade13 kernel: CMAN: got node blade06 Jul 21 17:04:21 blade13 kernel: CMAN: got node blade07 Jul 21 17:04:21 blade13 kernel: CMAN: got node blade08 Jul 21 17:04:21 blade13 kernel: CMAN: got node blade11 Jul 21 17:04:21 blade13 kernel: CMAN: got node blade01 Jul 21 17:04:24 blade13 clvmd: Cluster LVM daemon started - connected to CMAN Jul 21 17:04:24 blade13 kernel: CMAN: WARNING no listener for port 11 on node blade01 Jul 21 17:18:16 blade13 kernel: GFS: Trying to join cluster "lock_dlm", "blade_cluster:lil_cheesy1_lv" Jul 21 17:18:18 blade13 kernel: GFS: fsid=blade_cluster:lil_cheesy1_lv.0: Joined cluster. Now mounting FS... Jul 21 17:18:18 blade13 kernel: GFS: fsid=blade_cluster:lil_cheesy1_lv.0: jid=0: Trying to acquire journal lock... Jul 21 17:18:18 blade13 kernel: GFS: fsid=blade_cluster:lil_cheesy1_lv.0: jid=0: Looking at journal... Jul 21 17:18:18 blade13 kernel: GFS: fsid=blade_cluster:lil_cheesy1_lv.0: jid=0: Done (last message repeated 13 times) Jul 21 23:14:57 blade13 kernel: CMAN: node blade04 rejoining Jul 21 23:16:52 blade13 kernel: CMAN: node blade12 rejoining Jul 21 23:21:16 blade13 kernel: CMAN: node blade12 has been removed from the cluster : Shutdown Jul 21 23:23:02 blade13 kernel: CMAN: node blade02 has been removed from the cluster : Missed too many heartbeats Jul 21 23:23:03 blade13 kernel: SM: 00000001 process_recovery_barrier status=-104 Jul 21 23:23:27 blade13 kernel: CMAN: node blade03 has been removed from the cluster : Missed too many heartbeats Jul 21 23:23:28 blade13 kernel: SM: 00000001 process_recovery_barrier status=-104 Jul 21 23:24:12 blade13 kernel: CMAN: node blade06 has been removed from the cluster : Missed too many heartbeats Jul 21 23:24:13 blade13 kernel: SM: 00000001 process_recovery_barrier status=-104 Jul 21 23:24:33 blade13 kernel: CMAN: node blade09 has been removed from the cluster : No response to messages Jul 21 23:24:43 blade13 kernel: CMAN: removing node blade08 from the cluster : No response to messages Jul 21 23:24:43 blade13 kernel: CMAN: removing node blade07 from the cluster : No response to messages Jul 21 23:24:53 blade13 kernel: SM: 00000001 process_recovery_barrier status=-104 > > blade13:~ # cman_tool nodes > > Node Votes Exp Sts Name > > 1 1 1 M blade01 > > 2 1 1 X blade02 > > 3 1 1 X blade03 > > 4 1 1 X blade04 > > 6 1 1 X blade06 > > 7 1 1 X blade07 > > 8 1 1 X blade08 > > 9 1 1 X blade09 > > 10 1 1 X blade10 > > 11 1 1 X blade11 > > 12 1 1 X blade12 > > 13 1 1 M blade13 > > 14 1 1 X blade14 > > > > blade13:~ # cman_tool status > > Protocol version: 5.0.1 > > Config version: 1 > > Cluster name: blade_cluster > > Cluster ID: 38068 > > Cluster Member: Yes > > Membership state: Cluster-Member > > Nodes: 2 > > Expected_votes: 1 > > Total_votes: 2 > > Quorum: 2 > > Active subsystems: 6 > > Node name: blade13 > > > > blade13:~ # cman_tool services > > Service Name GID LID State Code > > Fence Domain: "default" 1 2 recover 2 - > > [13] > > -- Linux-cluster@xxxxxxxxxx http://www.redhat.com/mailman/listinfo/linux-cluster