Re: Necessary a delay to restart cman?

Adam Hough <adam@xxxxxxxxxxxxxxxx> · Wed, 6 May 2009 07:59:05 -0500

On Wed, May 6, 2009 at 7:01 AM, Chrissie Caulfield <ccaulfie@xxxxxxxxxx> wrote:
> Miguel Sanchez wrote:
>> Hi. I have a CentOS 5.3 cluster with two nodes. If I execute service
>> cman restart within a node, or stop + start after few seconds, another
>> node doesn´t recognize this membership return and its fellow stay
>> forever offline.
>>
>> For example:
>>
>> * Before cman restart:
>>
>> node1# cman_tool status
>> Version: 6.1.0
>> Config Version: 6
>> Cluster Name: CSVirtualizacion
>> Cluster Id: 42648
>> Cluster Member: Yes
>> Cluster Generation: 202600
>> Membership state: Cluster-Member
>> Nodes: 2
>> Expected votes: 1
>> Total votes: 2
>> Quorum: 1
>> Active subsystems: 7
>> Flags: 2node Dirty
>> Ports Bound: 0
>> Node name: patty
>> Node ID: 1
>> Multicast addresses: 224.0.0.133
>> Node addresses: 138.100.8.70
>>
>> * After cman stop for node2 (and before a number seconds < token parameter)
>>
>> node1# cman_tool status
>> Version: 6.1.0
>> Config Version: 6
>> Cluster Name: CSVirtualizacion
>> Cluster Id: 42648
>> Cluster Member: Yes
>> Cluster Generation: 202600
>> Membership state: Cluster-Member
>> Nodes: 2
>> Expected votes: 1
>> Total votes: 1
>> Quorum: 1
>> Active subsystems: 7
>> Flags: 2node Dirty
>> Ports Bound: 0
>> Node name: patty
>> Node ID: 1
>> Multicast addresses: 224.0.0.133
>> Node addresses: 138.100.8.70
>> Wed May  6 12:29:38 CEST 2009
>>
>> * After cman stop for node2 (and after a number seconds > token parameter)
>>
>> node1# date; cman_tool status
>> Version: 6.1.0
>> Config Version: 6
>> Cluster Name: CSVirtualizacion
>> Cluster Id: 42648
>> Cluster Member: Yes
>> Cluster Generation: 202604
>> Membership state: Cluster-Member
>> Nodes: 1
>> Expected votes: 1
>> Total votes: 1
>> Quorum: 1
>> Active subsystems: 7
>> Flags: 2node Dirty
>> Ports Bound: 0
>> Node name: patty
>> Node ID: 1
>> Multicast addresses: 224.0.0.133
>> Node addresses: 138.100.8.70
>> Wed May  6 12:29:47 CEST 2009
>>
>> /var/log/messages:
>> May  6 12:35:20 node2 openais[17262]: [TOTEM] The token was lost in the
>> OPERATIONAL state.
>> May  6 12:35:20 node2 openais[17262]: [TOTEM] Receive multicast socket
>> recv buffer size (288000 bytes).
>> May  6 12:35:20 node2 openais[17262]: [TOTEM] Transmit multicast socket
>> send buffer size (262142 bytes).
>> May  6 12:35:20 node2 openais[17262]: [TOTEM] entering GATHER state from 2.
>> May  6 12:35:25 node2 openais[17262]: [TOTEM] entering GATHER state from 0.
>> May  6 12:35:25 node2 openais[17262]: [TOTEM] Creating commit token
>> because I am the rep.
>> May  6 12:35:25 node2 openais[17262]: [TOTEM] Saving state aru 26 high
>> seq received 26
>> May  6 12:35:25 node2 openais[17262]: [TOTEM] Storing new sequence id
>> for ring 31780
>> May  6 12:35:25 node2 openais[17262]: [TOTEM] entering COMMIT state.
>> May  6 12:35:25 node2 openais[17262]: [TOTEM] entering RECOVERY state.
>> May  6 12:35:25 node2 openais[17262]: [TOTEM] position [0] member
>> 10.10.8.70:
>> May  6 12:35:25 node2 openais[17262]: [TOTEM] previous ring seq 202620
>> rep 10.10.8.70
>> May  6 12:35:25 node2 openais[17262]: [TOTEM] aru 26 high delivered 26
>> received flag 1
>> May  6 12:35:25 node2 openais[17262]: [TOTEM] Did not need to originate
>> any messages in recovery.
>> May  6 12:35:25 node2 openais[17262]: [TOTEM] Sending initial ORF token
>> May  6 12:35:25 node2 openais[17262]: [CLM  ] CLM CONFIGURATION CHANGE
>> May  6 12:35:25 node2 openais[17262]: [CLM  ] New Configuration:
>> May  6 12:35:25 node2 openais[17262]: [CLM  ]   r(0) ip(10.10.8.70)
>> May  6 12:35:25 node2 openais[17262]: [CLM  ] Members Left:
>> May  6 12:35:25 node2 openais[17262]: [CLM  ]   r(0) ip(10.10.8.71)
>> May  6 12:35:25 node2 openais[17262]: [CLM  ] Members Joined:
>> May  6 12:35:25 node2 openais[17262]: [CLM  ] CLM CONFIGURATION CHANGE
>> May  6 12:35:25 node2 openais[17262]: [CLM  ] New Configuration:
>> May  6 12:35:25 node2 openais[17262]: [CLM  ]   r(0) ip(10.10.8.70)
>> May  6 12:35:25 node2 openais[17262]: [CLM  ] Members Left:
>> May  6 12:35:25 node2 openais[17262]: [CLM  ] Members Joined:
>> May  6 12:35:25 node2 openais[17262]: [SYNC ] This node is within the
>> primary component and will provide service.
>> May  6 12:35:25 node2 openais[17262]: [TOTEM] entering OPERATIONAL state.
>> May  6 12:35:25 node2 kernel: dlm: closing connection to node 2
>> May  6 12:35:25 node2 openais[17262]: [CLM  ] got nodejoin message
>> 10.10.8.70
>> May  6 12:35:25 node2 openais[17262]: [CPG  ] got joinlist message from
>> node 1
>>
>>
>> if node2 doesn`t wait for run cman start to the detection the
>> operational token's lost, node1 detect node2 like offline forever.
>> Following attempts for cman restarts don`t change this state:
>> node1# cman_tool nodes
>> Node  Sts   Inc   Joined               Name
>>   1   M  202616   2009-05-06 12:34:43  node1
>>   2   X  202628                        node2
>> node2# cman_tool nodes
>> Node  Sts   Inc   Joined               Name
>>   1   M  202644   2009-05-06 12:51:04  node1
>>   2   M  202640   2009-05-06 12:51:04  node2
>>
>>
>> Is it necessary a delay for cman stop + start to avoid this inconsistent
>> state or really is it a bug?
>
>
> I suspect it's an instance of this known bug. Check that CentOS has the
> appropriate patch available:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=485026
>
> Chrissie
>
> --
> Linux-cluster mailing list
> Linux-cluster@xxxxxxxxxx
> https://www.redhat.com/mailman/listinfo/linux-cluster
>

When restarting cman, I have always had to stop cman and then manually
stop openais before trying to start cman again.   If I do not follow
these steps then the node would never rejoin the cluster or might
fence the other node.

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster