Some nodes won't join after being fenced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

I currently have a 9 node centos 5.1 cman/gfs cluster which I've managed to break.

It is broken in almost exactly the same way as stated in these two previous threads:

http://www.spinics.net/lists/cluster/msg10304.html
http://www.redhat.com/archives/linux-cluster/2008-May/msg00060.html

However, I can find no resolution in the archives. My only guaranteed resolution at this point is a cold restart of all nodes which to me seems ridiculous (ie: I'm missing something).

To add a little details, I have nodes cluster1...9. Nodes 7 & 8 are broken. When I fence/reboot them, cman starts but times out on starting fencing. cman_tools nodes shows them as joined but the fence domain looks broke.

Any ideas?

I have included some information for a good node, bad node, and /var/log/messages from a good node that did the fencing.

Good Node:

[root@cluster1 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M    768   2008-07-31 12:47:19  cluster1-rhc
   2   M    776   2008-07-31 12:47:37  cluster2-rhc
   3   M    772   2008-07-31 12:47:19  cluster3-rhc
   4   M    788   2008-07-31 12:56:20  cluster4-rhc
   5   M    772   2008-07-31 12:47:19  cluster5-rhc
   6   M    784   2008-07-31 12:52:50  cluster6-rhc
   7   M    808   2008-07-31 13:24:24  cluster7-rhc
   8   X    800                        cluster8-rhc
   9   M    772   2008-07-31 12:47:19  cluster9-rhc
[root@cluster1 ~]# cman_tool services
type             level name      id       state
fence            0     default   00010003 FAIL_START_WAIT
[1 2 3 4 5 6 9]
dlm              1     testgfs1  00020005 none
[1 2 3 4 5 6]
gfs              2     testgfs1  00010005 none
[1 2 3 4 5 6]
[root@cluster1 ~]# cman_tool status
Version: 6.1.0
Config Version: 13
Cluster Name: test
Cluster Id: 1678
Cluster Member: Yes
Cluster Generation: 808
Membership state: Cluster-Member
Nodes: 8
Expected votes: 9
Total votes: 8
Quorum: 5
Active subsystems: 7
Flags: Dirty
Ports Bound: 0
Node name: cluster1-rhc
Node ID: 1
Multicast addresses: 239.192.6.148
Node addresses: 10.128.161.81
[root@cluster1 ~]# group_tool
type             level name      id       state
fence            0     default   00010003 FAIL_START_WAIT
[1 2 3 4 5 6 9]
dlm              1     testgfs1  00020005 none
[1 2 3 4 5 6]
gfs              2     testgfs1  00010005 none
[1 2 3 4 5 6]
[root@cluster1 ~]#


Bad/broken Node:

[root@cluster7 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M    808   2008-07-31 13:24:24  cluster1-rhc
   2   M    808   2008-07-31 13:24:24  cluster2-rhc
   3   M    808   2008-07-31 13:24:24  cluster3-rhc
   4   M    808   2008-07-31 13:24:24  cluster4-rhc
   5   M    808   2008-07-31 13:24:24  cluster5-rhc
   6   M    808   2008-07-31 13:24:24  cluster6-rhc
   7   M    804   2008-07-31 13:24:24  cluster7-rhc
   8   X      0                        cluster8-rhc
   9   M    808   2008-07-31 13:24:24  cluster9-rhc
[root@cluster7 ~]# cman_tool services
type             level name     id       state
fence            0     default  00000000 JOIN_STOP_WAIT
[1 2 3 4 5 6 7 9]
[root@cluster7 ~]# cman_tool status
Version: 6.1.0
Config Version: 13
Cluster Name: test
Cluster Id: 1678
Cluster Member: Yes
Cluster Generation: 808
Membership state: Cluster-Member
Nodes: 8
Expected votes: 9
Total votes: 8
Quorum: 5
Active subsystems: 7
Flags: Dirty
Ports Bound: 0
Node name: cluster7-rhc
Node ID: 7
Multicast addresses: 239.192.6.148
Node addresses: 10.128.161.87
[root@cluster7 ~]# group_tool
type             level name     id       state
fence            0     default  00000000 JOIN_STOP_WAIT
[1 2 3 4 5 6 7 9]
[root@cluster7 ~]#


/var/log/messages:

Jul 31 13:20:54 cluster3 fence_node[3813]: Fence of "cluster7-rhc" was successful Jul 31 13:21:03 cluster3 fence_node[3815]: Fence of "cluster8-rhc" was successful Jul 31 13:21:11 cluster3 openais[3084]: [TOTEM] entering GATHER state from 12. Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering GATHER state from 11. Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] Saving state aru 89 high seq received 89 Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] Storing new sequence id for ring 324
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering COMMIT state.
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering RECOVERY state.
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [0] member 10.128.161.81: Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep 10.128.161.81 Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 received flag 1 Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [1] member 10.128.161.82: Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep 10.128.161.81 Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 received flag 1 Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [2] member 10.128.161.83: Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep 10.128.161.81
Jul 31 13:21:16 cluster3 kernel: dlm: closing connection to node 7
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 received flag 1
Jul 31 13:21:16 cluster3 kernel: dlm: closing connection to node 8
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [3] member 10.128.161.84: Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep 10.128.161.81 Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 received flag 1 Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [4] member 10.128.161.85: Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep 10.128.161.81 Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 received flag 1 Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [5] member 10.128.161.86: Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep 10.128.161.81 Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 received flag 1 Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [6] member 10.128.161.89: Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep 10.128.161.81 Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 received flag 1 Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] Did not need to originate any messages in recovery.
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] CLM CONFIGURATION CHANGE
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] New Configuration:
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.81) Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.82) Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.83) Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.84) Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.85) Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.86) Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.89)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] Members Left:
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.87) Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.88)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] Members Joined:
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] CLM CONFIGURATION CHANGE
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] New Configuration:
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.81) Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.82) Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.83) Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.84) Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.85) Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.86) Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.89)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] Members Left:
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] Members Joined:
Jul 31 13:21:16 cluster3 openais[3084]: [SYNC ] This node is within the primary component and will provide service.
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering OPERATIONAL state.
Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] got nodejoin message 10.128.161.81 Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] got nodejoin message 10.128.161.82 Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] got nodejoin message 10.128.161.83 Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] got nodejoin message 10.128.161.84 Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] got nodejoin message 10.128.161.85 Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] got nodejoin message 10.128.161.86 Jul 31 13:21:16 cluster3 openais[3084]: [CLM ] got nodejoin message 10.128.161.89 Jul 31 13:21:16 cluster3 openais[3084]: [CPG ] got joinlist message from node 2 Jul 31 13:21:16 cluster3 openais[3084]: [CPG ] got joinlist message from node 3 Jul 31 13:21:16 cluster3 openais[3084]: [CPG ] got joinlist message from node 4 Jul 31 13:21:16 cluster3 openais[3084]: [CPG ] got joinlist message from node 5 Jul 31 13:21:16 cluster3 openais[3084]: [CPG ] got joinlist message from node 6 Jul 31 13:21:16 cluster3 openais[3084]: [CPG ] got joinlist message from node 9 Jul 31 13:21:16 cluster3 openais[3084]: [CPG ] got joinlist message from node 1 Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering GATHER state from 11. Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] Saving state aru 68 high seq received 68 Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] Storing new sequence id for ring 328
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering COMMIT state.
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering RECOVERY state.
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [0] member 10.128.161.81: Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep 10.128.161.81 Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 received flag 1 Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [1] member 10.128.161.82: Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep 10.128.161.81 Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 received flag 1 Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [2] member 10.128.161.83: Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep 10.128.161.81 Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 received flag 1 Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [3] member 10.128.161.84: Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep 10.128.161.81 Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 received flag 1 Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [4] member 10.128.161.85: Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep 10.128.161.81 Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 received flag 1 Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [5] member 10.128.161.86: Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep 10.128.161.81 Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 received flag 1 Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [6] member 10.128.161.87: Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep 10.128.161.87 Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 9 high delivered 9 received flag 1 Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [7] member 10.128.161.89: Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep 10.128.161.81 Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 received flag 1 Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] Did not need to originate any messages in recovery.
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] CLM CONFIGURATION CHANGE
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] New Configuration:
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.81) Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.82) Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.83) Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.84) Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.85) Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.86) Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.89)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] Members Left:
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] Members Joined:
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] CLM CONFIGURATION CHANGE
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] New Configuration:
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.81) Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.82) Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.83) Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.84) Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.85) Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.86) Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.87) Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.89)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] Members Left:
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] Members Joined:
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] r(0) ip(10.128.161.87) Jul 31 13:24:24 cluster3 openais[3084]: [SYNC ] This node is within the primary component and will provide service.
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering OPERATIONAL state.
Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] got nodejoin message 10.128.161.81 Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] got nodejoin message 10.128.161.82 Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] got nodejoin message 10.128.161.83 Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] got nodejoin message 10.128.161.84 Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] got nodejoin message 10.128.161.85 Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] got nodejoin message 10.128.161.86 Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] got nodejoin message 10.128.161.87 Jul 31 13:24:24 cluster3 openais[3084]: [CLM ] got nodejoin message 10.128.161.89 Jul 31 13:24:24 cluster3 openais[3084]: [CPG ] got joinlist message from node 6 Jul 31 13:24:24 cluster3 openais[3084]: [CPG ] got joinlist message from node 9 Jul 31 13:24:24 cluster3 openais[3084]: [CPG ] got joinlist message from node 1 Jul 31 13:24:24 cluster3 openais[3084]: [CPG ] got joinlist message from node 2 Jul 31 13:24:24 cluster3 openais[3084]: [CPG ] got joinlist message from node 3 Jul 31 13:24:24 cluster3 openais[3084]: [CPG ] got joinlist message from node 4 Jul 31 13:24:24 cluster3 openais[3084]: [CPG ] got joinlist message from node 5

Thanks!

Adam

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux