On 24/02/14 08:39, Bjoern Teipel wrote:
Hi Fabio, removing UDPU does not change the behavior, the new node still doesn't join the cluster and still wants to fence node 01 It still feels like a split brain or so. How do you join a new node, using the /etc/init.d/cman start or using cman_tool / dlm_tool join ? Bjoern On Sat, Feb 22, 2014 at 10:16 PM, Fabio M. Di Nitto <fdinitto@xxxxxxxxxx <mailto:fdinitto@xxxxxxxxxx>> wrote: On 02/22/2014 08:05 PM, Bjoern Teipel wrote: > Thanks Fabio for replying may request. > > I'm using stock CentOS 6.4 versions and no rm, just clvmd and dlm. > > Name : cman Relocations: (not relocatable) > Version : 3.0.12.1 Vendor: CentOS > Release : 49.el6_4.2 Build Date: Tue 03 Sep 2013 > 02:18:10 AM PDT > > Name : lvm2-cluster Relocations: (not relocatable) > Version : 2.02.98 Vendor: CentOS > Release : 9.el6_4.3 Build Date: Tue 05 Nov 2013 > 07:36:18 AM PST > > Name : corosync Relocations: (not relocatable) > Version : 1.4.1 Vendor: CentOS > Release : 15.el6_4.1 Build Date: Tue 14 May 2013 > 02:09:27 PM PDT > > > My question is based off this problem I have till January: > > > When ever I add a new node (I put into the cluster.conf and reloaded > with cman_tool version -r -S) I end up with situations like the new > node wants to gain the quorum and starts to fence the existing pool > master and appears to generate some sort of split cluster. Does it work > at all, corosync and dlm do not know about the recently added node ? I can see you are using UDPU and that could be the culprit. Can you drop UDPU and work with multicast? Jan/Chrissie: do you remember if we support adding nodes at runtime with UDPU? The standalone node should not have quorum at all and should not be able to fence anybody to start with. > > New Node > ========== > > Node Sts Inc Joined Name > 1 X 0 hv-1 > 2 X 0 hv-2 > 3 X 0 hv-3 > 4 X 0 hv-4 > 5 X 0 hv-5 > 6 M 80 2014-01-07 21:37:42 hv-6<--- host added > > > Jan 7 21:37:42 hv-1 corosync[12564]: [TOTEM ] The network interface > [10.14.18.77] is now up. > Jan 7 21:37:42 hv-1 corosync[12564]: [QUORUM] Using quorum provider > quorum_cman > Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service engine loaded: > corosync cluster quorum service v0.1 > Jan 7 21:37:42 hv-1 corosync[12564]: [CMAN ] CMAN 3.0.12.1 (built > Sep 3 2013 09:17:34) started > Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service engine loaded: > corosync CMAN membership service 2.90 > Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service engine loaded: > openais checkpoint service B.01.01 > Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service engine loaded: > corosync extended virtual synchrony service > Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service engine loaded: > corosync configuration service > Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service engine loaded: > corosync cluster closed process group service v1.01 > Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service engine loaded: > corosync cluster config database access v1.01 > Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service engine loaded: > corosync profile loading service > Jan 7 21:37:42 hv-1 corosync[12564]: [QUORUM] Using quorum provider > quorum_cman > Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service engine loaded: > corosync cluster quorum service v0.1 > Jan 7 21:37:42 hv-1 corosync[12564]: [MAIN ] Compatibility mode set > to whitetank. Using V1 and V2 of the synchronization engine. > Jan 7 21:37:42 hv-1 corosync[12564]: [TOTEM ] adding new UDPU member > {10.14.18.65} > Jan 7 21:37:42 hv-1 corosync[12564]: [TOTEM ] adding new UDPU member > {10.14.18.67} > Jan 7 21:37:42 hv-1 corosync[12564]: [TOTEM ] adding new UDPU member > {10.14.18.68} > Jan 7 21:37:42 hv-1 corosync[12564]: [TOTEM ] adding new UDPU member > {10.14.18.70} > Jan 7 21:37:42 hv-1 corosync[12564]: [TOTEM ] adding new UDPU member > {10.14.18.66} > Jan 7 21:37:42 hv-1 corosync[12564]: [TOTEM ] adding new UDPU member > {10.14.18.77} > Jan 7 21:37:42 hv-1 corosync[12564]: [TOTEM ] A processor joined or > left the membership and a new membership was formed. > Jan 7 21:37:42 hv-1 corosync[12564]: [CMAN ] quorum regained, > resuming activity > Jan 7 21:37:42 hv-1 corosync[12564]: [QUORUM] This node is within the > primary component and will provide service. > Jan 7 21:37:42 hv-1 corosync[12564]: [QUORUM] Members[1]: 6 > Jan 7 21:37:42 hv-1 corosync[12564]: [QUORUM] Members[1]: 6 > Jan 7 21:37:42 hv-1 corosync[12564]: [CPG ] chosen downlist: sender > r(0) ip(10.14.18.77) ; members(old:0 left:0) > Jan 7 21:37:42 hv-1 corosync[12564]: [MAIN ] Completed service > synchronization, ready to provide service. > Jan 7 21:37:46 hv-1 fenced[12620]: fenced 3.0.12.1 started > Jan 7 21:37:46 hv-1 dlm_controld[12643]: dlm_controld 3.0.12.1 started > Jan 7 21:37:47 hv-1 gfs_controld[12695]: gfs_controld 3.0.12.1 started > Jan 7 21:37:54 hv-1 fenced[12620]: fencing node hv-b1clcy1 > > sudo -i corosync-objctl |grep member > > totem.interface.member.memberaddr=hv-1 > totem.interface.member.memberaddr=hv-2 > totem.interface.member.memberaddr=hv-3 > totem.interface.member.memberaddr=hv-4 > totem.interface.member.memberaddr=hv-5 > totem.interface.member.memberaddr=hv-6 > runtime.totem.pg.mrp.srp.members.6.ip=r(0) ip(10.14.18.77) > runtime.totem.pg.mrp.srp.members.6.join_count=1 > runtime.totem.pg.mrp.srp.members.6.status=joined > > > Existing Node > ============= > > member 6 has not been added to the quorum list : > > Jan 7 21:36:28 hv-1 corosync[7769]: [QUORUM] Members[4]: 1 2 3 5 > Jan 7 21:37:54 hv-1 corosync[7769]: [TOTEM ] A processor joined or > left the membership and a new membership was formed. > Jan 7 21:37:54 hv-1 corosync[7769]: [CPG ] chosen downlist: sender > r(0) ip(10.14.18.65) ; members(old:4 left:0) > > > Node Sts Inc Joined Name > 1 M 4468 2013-12-10 14:33:27 hv-1 > 2 M 4468 2013-12-10 14:33:27 hv-2 > 3 M 5036 2014-01-07 17:51:26 hv-3 > 4 X 4468 hv-4(dead at the moment) > 5 M 4468 2013-12-10 14:33:27 hv-5 > 6 X 0 hv-6<--- added > > > Jan 7 21:36:28 hv-1 corosync[7769]: [QUORUM] Members[4]: 1 2 3 5 > Jan 7 21:37:54 hv-1 corosync[7769]: [TOTEM ] A processor joined or > left the membership and a new membership was formed. > Jan 7 21:37:54 hv-1 corosync[7769]: [CPG ] chosen downlist: sender > r(0) ip(10.14.18.65) ; members(old:4 left:0) > Jan 7 21:37:54 hv-1 corosync[7769]: [MAIN ] Completed service > synchronization, ready to provide service. > > > totem.interface.member.memberaddr=hv-1 > totem.interface.member.memberaddr=hv-2 > totem.interface.member.memberaddr=hv-3 > totem.interface.member.memberaddr=hv-4 > totem.interface.member.memberaddr=hv-5. > runtime.totem.pg.mrp.srp.members.1.ip=r(0) ip(10.14.18.65) > runtime.totem.pg.mrp.srp.members.1.join_count=1 > runtime.totem.pg.mrp.srp.members.1.status=joined > runtime.totem.pg.mrp.srp.members.2.ip=r(0) ip(10.14.18.66) > runtime.totem.pg.mrp.srp.members.2.join_count=1 > runtime.totem.pg.mrp.srp.members.2.status=joined > runtime.totem.pg.mrp.srp.members.4.ip=r(0) ip(10.14.18.68) > runtime.totem.pg.mrp.srp.members.4.join_count=1 > runtime.totem.pg.mrp.srp.members.4.status=left > runtime.totem.pg.mrp.srp.members.5.ip=r(0) ip(10.14.18.70) > runtime.totem.pg.mrp.srp.members.5.join_count=1 > runtime.totem.pg.mrp.srp.members.5.status=joined > runtime.totem.pg.mrp.srp.members.3.ip=r(0) ip(10.14.18.67) > runtime.totem.pg.mrp.srp.members.3.join_count=3 > runtime.totem.pg.mrp.srp.members.3.status=joined > > > cluster.conf: > > <?xml version="1.0"?> > <cluster config_version="32" name="hv-1618-110-1"> > <fence_daemon clean_start="0"/> > <cman transport="udpu" expected_votes="1"/>
Setting expected_votes to 1 in a six node cluster is a serious configuration error and needs to be changed. That is what is causing the new node to fence the rest of the cluster.
Check that all of the nodes have the same cluster.conf file, any difference between that on the exiting nodes and the new one will prevent the new node from joining too.
Chrissie -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster