The story continues... On Sun, Jul 01, 2007 at 02:30:40PM +0300, Janne Peltonen wrote: > > Sometimes, when I have cleanly shut down rgmanager on one node, and the > > services have nicely migrated to other nodes, trying to start rgmanager > > fails. Trying to access /dev/misc/dlm_rgmanager results in "No such > > device". clurgmgrd concludes that locks are not working and exits. > > (See strace output attached.) > Interesting. After the one node with failing rgmanagers was shot in the > head (there were no log lines about fencing, only two about deferring > fencing to an earlier node), the fenced node was left in 'off' state, and, > well, the other nodes had their services left running (but rgmanagers > apparently stuck - no more status checks an no response to the clustat > command). Now, the cluster node whose fencing resulted in a stuck system came up and joined the cluster. [jmmpelto@pcn1 ~]$ sudo cman_tool services type level name id state fence 0 default 00000000 JOIN_STOP_WAIT [1 2 3 4 100] dlm 1 clvmd 00000000 JOIN_STOP_WAIT [1 2 3 4 100] [jmmpelto@pcn1 ~]$ sudo cman_tool status Version: 6.0.1 Config Version: 40 Cluster Name: mappi-primary Cluster Id: 11929 Cluster Member: Yes Cluster Generation: 184 Membership state: Cluster-Member Nodes: 5 Expected votes: 5 Total votes: 5 Quorum: 3 Active subsystems: 8 Flags: Ports Bound: 0 11 Node name: pcn1-hb Node ID: 1 Multicast addresses: 239.192.46.199 Node addresses: 10.3.0.11 I killed the completely stuck pcn2-hb from there: [jmmpelto@pcn1 ~]$ sudo cman_tool kill -n pcn2-hb Log: Jul 1 14:36:36 pcn2.mappi.helsinki.fi dlm_controld[4577]: cluster is down, exiting Jul 1 14:36:36 pcn2.mappi.helsinki.fi gfs_controld[4583]: cluster is down, exiting Jul 1 14:36:36 pcn2.mappi.helsinki.fi fenced[4571]: cluster is down, exiting Jul 1 14:36:59 pcn2.mappi.helsinki.fi ccsd[4508]: Unable to connect to cluster infrastructure after 30 seconds. Thereafter, node pcn3-hb fenced it, this time with log entries: Jul 1 14:36:50 pcn3.mappi.helsinki.fi fenced[4371]: pcn2-hb not a cluster member after 0 sec post_fail_delay Jul 1 14:36:50 pcn3.mappi.helsinki.fi fenced[4371]: pcn1-hb not a cluster member after 0 sec post_fail_delay Jul 1 14:36:50 pcn3.mappi.helsinki.fi fenced[4371]: fencing node "pcn2-hb" Jul 1 14:38:08 pcn3.mappi.helsinki.fi fenced[4371]: fence "pcn2-hb" success Jul 1 14:38:13 pcn3.mappi.helsinki.fi ccsd[4308]: Attempt to close an unopened CCS descriptor (3012450). Jul 1 14:38:13 pcn3.mappi.helsinki.fi ccsd[4308]: Error while processing disconnect: Invalid request descriptor But nobody tried to fence pcn1-hb (see the second log line). But apparently, pcn3-hb tried to say something to pcn1-hb. Jul 1 14:38:13 pcn1.mappi.helsinki.fi fenced[4461]: fencing deferred to prior member Jul 1 14:38:13 pcn1.mappi.helsinki.fi dlm_controld[4467]: open "/sys/kernel/dlm/rgmanager/id" error -1 2 Jul 1 14:38:13 pcn1.mappi.helsinki.fi dlm_controld[4467]: open "/sys/kernel/dlm/rgmanager/control" error -1 2 Jul 1 14:38:13 pcn1.mappi.helsinki.fi dlm_controld[4467]: open "/sys/kernel/dlm/rgmanager/event_done" error -1 2 This time the services are in no specific state, but the rgmanager still does nothin constructive: [jmmpelto@pcn3 ~]$ sudo cman_tool services Password: type level name id state fence 0 default 00010001 none [1 3 4 100] dlm 1 clvmd 00010002 none [1 3 4 100] dlm 1 rgmanager 00020002 none [1 3 4] [jmmpelto@pcn3 ~]$ sudo clustat Timed out waiting for a response from Resource Group Manager Member Status: Quorate Member Name ID Status ------ ---- ---- ------ pcnm-hb 100 Online pcn1-hb 1 Online pcn2-hb 2 Offline pcn3-hb 3 Online, Local pcn4-hb 4 Online On node pcn1-hb: [jmmpelto@pcn1 ~]$ sudo cman_tool services type level name id state fence 0 default 00010001 none [1 3 4 100] dlm 1 clvmd 00010002 none [1 3 4 100] dlm 1 rgmanager 00020002 none [1 3 4] [jmmpelto@pcn1 ~]$ [jmmpelto@pcn1 ~]$ [jmmpelto@pcn1 ~]$ sudo clustat Member Status: Quorate Member Name ID Status ------ ---- ---- ------ pcnm-hb 100 Online pcn1-hb 1 Online, Local pcn2-hb 2 Offline pcn3-hb 3 Online pcn4-hb 4 Online Er again. --Janne -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster