I have occasionally run into this problem, too. I have found that sometimes I can work around the problem by chkconfig'ing clvmd,cman,and rgmanager off, rebooting, then manually starting cman, rgmanager, clvmd (in that order). Usually, after that, I am able to fence the node(s) and they will rejoin automatically (after re-enabling automatic startup with chkconfig, of course). I know this workaround doesn't explain *why* it happens, but it has more than once helped me get my cluster nodes back online without having to reboot all the nodes.
On Thu, Jul 31, 2008 at 1:42 PM, Mailing List <ml@xxxxxxxxxxxx> wrote:
Hello,
I currently have a 9 node centos 5.1 cman/gfs cluster which I've managed to break.
It is broken in almost exactly the same way as stated in these two previous threads:
http://www.spinics.net/lists/cluster/msg10304.html
http://www.redhat.com/archives/linux-cluster/2008-May/msg00060.html
However, I can find no resolution in the archives. My only guaranteed resolution at this point is a cold restart of all nodes which to me seems ridiculous (ie: I'm missing something).
To add a little details, I have nodes cluster1...9. Nodes 7 & 8 are broken. When I fence/reboot them, cman starts but times out on starting fencing. cman_tools nodes shows them as joined but the fence domain looks broke.
Any ideas?
I have included some information for a good node, bad node, and /var/log/messages from a good node that did the fencing.
Good Node:
[root@cluster1 ~]# cman_tool nodes
Node Sts Inc Joined Name
1 M 768 2008-07-31 12:47:19 cluster1-rhc
2 M 776 2008-07-31 12:47:37 cluster2-rhc
3 M 772 2008-07-31 12:47:19 cluster3-rhc
4 M 788 2008-07-31 12:56:20 cluster4-rhc
5 M 772 2008-07-31 12:47:19 cluster5-rhc
6 M 784 2008-07-31 12:52:50 cluster6-rhc
7 M 808 2008-07-31 13:24:24 cluster7-rhc
8 X 800 cluster8-rhc
9 M 772 2008-07-31 12:47:19 cluster9-rhc
[root@cluster1 ~]# cman_tool services
type level name id state
fence 0 default 00010003 FAIL_START_WAIT
[1 2 3 4 5 6 9]
dlm 1 testgfs1 00020005 none
[1 2 3 4 5 6]
gfs 2 testgfs1 00010005 none
[1 2 3 4 5 6]
[root@cluster1 ~]# cman_tool status
Version: 6.1.0
Config Version: 13
Cluster Name: test
Cluster Id: 1678
Cluster Member: Yes
Cluster Generation: 808
Membership state: Cluster-Member
Nodes: 8
Expected votes: 9
Total votes: 8
Quorum: 5
Active subsystems: 7
Flags: Dirty
Ports Bound: 0
Node name: cluster1-rhc
Node ID: 1
Multicast addresses: 239.192.6.148
Node addresses: 10.128.161.81
[root@cluster1 ~]# group_tool
type level name id state
fence 0 default 00010003 FAIL_START_WAIT
[1 2 3 4 5 6 9]
dlm 1 testgfs1 00020005 none
[1 2 3 4 5 6]
gfs 2 testgfs1 00010005 none
[1 2 3 4 5 6]
[root@cluster1 ~]#
Bad/broken Node:
[root@cluster7 ~]# cman_tool nodes
Node Sts Inc Joined Name
1 M 808 2008-07-31 13:24:24 cluster1-rhc
2 M 808 2008-07-31 13:24:24 cluster2-rhc
3 M 808 2008-07-31 13:24:24 cluster3-rhc
4 M 808 2008-07-31 13:24:24 cluster4-rhc
5 M 808 2008-07-31 13:24:24 cluster5-rhc
6 M 808 2008-07-31 13:24:24 cluster6-rhc
7 M 804 2008-07-31 13:24:24 cluster7-rhc
8 X 0 cluster8-rhc
9 M 808 2008-07-31 13:24:24 cluster9-rhc
[root@cluster7 ~]# cman_tool services
type level name id state
fence 0 default 00000000 JOIN_STOP_WAIT
[1 2 3 4 5 6 7 9]
[root@cluster7 ~]# cman_tool status
Version: 6.1.0
Config Version: 13
Cluster Name: test
Cluster Id: 1678
Cluster Member: Yes
Cluster Generation: 808
Membership state: Cluster-Member
Nodes: 8
Expected votes: 9
Total votes: 8
Quorum: 5
Active subsystems: 7
Flags: Dirty
Ports Bound: 0
Node name: cluster7-rhc
Node ID: 7
Multicast addresses: 239.192.6.148
Node addresses: 10.128.161.87
[root@cluster7 ~]# group_tool
type level name id state
fence 0 default 00000000 JOIN_STOP_WAIT
[1 2 3 4 5 6 7 9]
[root@cluster7 ~]#
/var/log/messages:
Jul 31 13:20:54 cluster3 fence_node[3813]: Fence of "cluster7-rhc" was successful
Jul 31 13:21:03 cluster3 fence_node[3815]: Fence of "cluster8-rhc" was successful
Jul 31 13:21:11 cluster3 openais[3084]: [TOTEM] entering GATHER state from 12.
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering GATHER state from 11.
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] Saving state aru 89 high seq received 89
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] Storing new sequence id for ring 324
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering COMMIT state.
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering RECOVERY state.
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [0] member 10.128.161.81:
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep 10.128.161.81
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 received flag 1
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [1] member 10.128.161.82:
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep 10.128.161.81
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 received flag 1
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [2] member 10.128.161.83:
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep 10.128.161.81
Jul 31 13:21:16 cluster3 kernel: dlm: closing connection to node 7
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 received flag 1
Jul 31 13:21:16 cluster3 kernel: dlm: closing connection to node 8
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [3] member 10.128.161.84:
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep 10.128.161.81
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 received flag 1
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [4] member 10.128.161.85:
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep 10.128.161.81
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 received flag 1
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [5] member 10.128.161.86:
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep 10.128.161.81
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 received flag 1
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [6] member 10.128.161.89:
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep 10.128.161.81
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 received flag 1
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] Did not need to originate any messages in recovery.
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] CLM CONFIGURATION CHANGE
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] New Configuration:
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.81)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.82)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.83)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.84)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.85)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.86)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.89)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] Members Left:
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.87)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.88)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] Members Joined:
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] CLM CONFIGURATION CHANGE
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] New Configuration:
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.81)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.82)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.83)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.84)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.85)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.86)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.89)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] Members Left:
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] Members Joined:
Jul 31 13:21:16 cluster3 openais[3084]: [SYNC ] This node is within the primary component and will provide service.
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering OPERATIONAL state.
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] got nodejoin message 10.128.161.81
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] got nodejoin message 10.128.161.82
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] got nodejoin message 10.128.161.83
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] got nodejoin message 10.128.161.84
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] got nodejoin message 10.128.161.85
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] got nodejoin message 10.128.161.86
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] got nodejoin message 10.128.161.89
Jul 31 13:21:16 cluster3 openais[3084]: [CPG ] got joinlist message from node 2
Jul 31 13:21:16 cluster3 openais[3084]: [CPG ] got joinlist message from node 3
Jul 31 13:21:16 cluster3 openais[3084]: [CPG ] got joinlist message from node 4
Jul 31 13:21:16 cluster3 openais[3084]: [CPG ] got joinlist message from node 5
Jul 31 13:21:16 cluster3 openais[3084]: [CPG ] got joinlist message from node 6
Jul 31 13:21:16 cluster3 openais[3084]: [CPG ] got joinlist message from node 9
Jul 31 13:21:16 cluster3 openais[3084]: [CPG ] got joinlist message from node 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering GATHER state from 11.
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] Saving state aru 68 high seq received 68
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] Storing new sequence id for ring 328
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering COMMIT state.
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering RECOVERY state.
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [0] member 10.128.161.81:
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep 10.128.161.81
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 received flag 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [1] member 10.128.161.82:
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep 10.128.161.81
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 received flag 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [2] member 10.128.161.83:
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep 10.128.161.81
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 received flag 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [3] member 10.128.161.84:
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep 10.128.161.81
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 received flag 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [4] member 10.128.161.85:
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep 10.128.161.81
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 received flag 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [5] member 10.128.161.86:
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep 10.128.161.81
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 received flag 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [6] member 10.128.161.87:
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep 10.128.161.87
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 9 high delivered 9 received flag 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [7] member 10.128.161.89:
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep 10.128.161.81
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 received flag 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] Did not need to originate any messages in recovery.
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] CLM CONFIGURATION CHANGE
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] New Configuration:
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.81)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.82)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.83)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.84)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.85)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.86)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.89)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] Members Left:
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] Members Joined:
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] CLM CONFIGURATION CHANGE
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] New Configuration:
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.81)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.82)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.83)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.84)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.85)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.86)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.87)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.89)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] Members Left:
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] Members Joined:
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.87)
Jul 31 13:24:24 cluster3 openais[3084]: [SYNC ] This node is within the primary component and will provide service.
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering OPERATIONAL state.
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] got nodejoin message 10.128.161.81
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] got nodejoin message 10.128.161.82
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] got nodejoin message 10.128.161.83
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] got nodejoin message 10.128.161.84
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] got nodejoin message 10.128.161.85
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] got nodejoin message 10.128.161.86
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] got nodejoin message 10.128.161.87
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] got nodejoin message 10.128.161.89
Jul 31 13:24:24 cluster3 openais[3084]: [CPG ] got joinlist message from node 6
Jul 31 13:24:24 cluster3 openais[3084]: [CPG ] got joinlist message from node 9
Jul 31 13:24:24 cluster3 openais[3084]: [CPG ] got joinlist message from node 1
Jul 31 13:24:24 cluster3 openais[3084]: [CPG ] got joinlist message from node 2
Jul 31 13:24:24 cluster3 openais[3084]: [CPG ] got joinlist message from node 3
Jul 31 13:24:24 cluster3 openais[3084]: [CPG ] got joinlist message from node 4
Jul 31 13:24:24 cluster3 openais[3084]: [CPG ] got joinlist message from node 5
Thanks!
Adam
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster