I still think you should check the multicast setup or maybe use UDPU or
broadcast, if only to eliminate the possibility. I've seen this sort of
thing happen when snooping is switched off for example. Multicast
packets do 'flow', but the switch doesn't allow new nodes to join the
multicast group - or at least not quickly enough for the cluster
protocol. It's a classic symptom - join 3 nodes at the same time, then
add another later and it can't. I can easily reproduce it in fact.
It might not be that, but it's best to check, because it's the most
common cause of this symptom IME.
The (highly) recommended thing to do with expected_votes is to leave it
out of cluster.conf altogether and let cman calculate it. That avoids
any nasty accidents like the last one ;-)
Chrissie
On 24/02/14 19:45, Bjoern Teipel wrote:
Thanks Chrissie,
that was an old artifact from testing with two nodes.
I set the expected votes now to 4 (3 existing nodes in the cluster and
one new) but I still have the same issue.
It seems like the new node can't gain quorum over corosync, I see
multicast packets flowing over the wire but quorum membership seems to
be static:
Feb 24 11:29:09 corosync [QUORUM] Members[3]: 1 2 3
Version: 6.2.0
Config Version: 4
Cluster Name: hv-1618-106-1
Cluster Id: 11612
Cluster Member: Yes
Cluster Generation: 244
Membership state: Cluster-Member
Nodes: 3
Expected votes: 4
Total votes: 3
Node votes: 1
Quorum: 3
Active subsystems: 8
Flags:
Ports Bound: 0 11
Node name: node01
Node ID: 1
Multicast addresses: 239.192.45.137
Node addresses: 10.14.10.6
On Node04:
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... [ OK ]
Waiting for quorum... Timed-out waiting for cluster
[FAILED]
Stopping cluster:
Leaving fence domain... [ OK ]
Stopping gfs_controld... [ OK ]
Stopping dlm_controld... [ OK ]
Stopping fenced... [ OK ]
Stopping cman... [ OK ]
Waiting for corosync to shutdown: [ OK ]
Unloading kernel modules... [ OK ]
Unmounting configfs... [ OK ]
Node status:
Node Sts Inc Joined Name
1 M 236 2014-02-24 00:22:32 node01
2 M 240 2014-02-24 00:22:34 node02
3 M 244 2014-02-24 00:22:38 node03
4 X 0 node04
On Mon, Feb 24, 2014 at 2:25 AM, Christine Caulfield
<ccaulfie@xxxxxxxxxx <mailto:ccaulfie@xxxxxxxxxx>> wrote:
On 24/02/14 08:39, Bjoern Teipel wrote:
Hi Fabio,
removing UDPU does not change the behavior, the new node still
doesn't
join the cluster and still wants to fence node 01
It still feels like a split brain or so.
How do you join a new node, using the /etc/init.d/cman start or
using
cman_tool / dlm_tool join ?
Bjoern
On Sat, Feb 22, 2014 at 10:16 PM, Fabio M. Di Nitto
<fdinitto@xxxxxxxxxx <mailto:fdinitto@xxxxxxxxxx>
<mailto:fdinitto@xxxxxxxxxx <mailto:fdinitto@xxxxxxxxxx>>> wrote:
On 02/22/2014 08:05 PM, Bjoern Teipel wrote:
> Thanks Fabio for replying may request.
>
> I'm using stock CentOS 6.4 versions and no rm, just
clvmd and dlm.
>
> Name : cman Relocations: (not
relocatable)
> Version : 3.0.12.1 Vendor:
CentOS
> Release : 49.el6_4.2 Build Date:
Tue 03
Sep 2013
> 02:18:10 AM PDT
>
> Name : lvm2-cluster Relocations: (not
relocatable)
> Version : 2.02.98 Vendor:
CentOS
> Release : 9.el6_4.3 Build Date:
Tue 05
Nov 2013
> 07:36:18 AM PST
>
> Name : corosync Relocations: (not
relocatable)
> Version : 1.4.1 Vendor:
CentOS
> Release : 15.el6_4.1 Build Date:
Tue 14
May 2013
> 02:09:27 PM PDT
>
>
> My question is based off this problem I have till January:
>
>
> When ever I add a new node (I put into the cluster.conf
and reloaded
> with cman_tool version -r -S) I end up with situations
like the new
> node wants to gain the quorum and starts to fence the
existing pool
> master and appears to generate some sort of split
cluster. Does
it work
> at all, corosync and dlm do not know about the recently
added node ?
I can see you are using UDPU and that could be the culprit.
Can you drop
UDPU and work with multicast?
Jan/Chrissie: do you remember if we support adding nodes at
runtime with
UDPU?
The standalone node should not have quorum at all and
should not be able
to fence anybody to start with.
>
> New Node
> ==========
>
> Node Sts Inc Joined Name
> 1 X 0 hv-1
> 2 X 0 hv-2
> 3 X 0 hv-3
> 4 X 0 hv-4
> 5 X 0 hv-5
> 6 M 80 2014-01-07 21:37:42 hv-6<--- host added
>
>
> Jan 7 21:37:42 hv-1 corosync[12564]: [TOTEM ] The network
interface
> [10.14.18.77] is now up.
> Jan 7 21:37:42 hv-1 corosync[12564]: [QUORUM] Using
quorum
provider
> quorum_cman
> Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service
engine
loaded:
> corosync cluster quorum service v0.1
> Jan 7 21:37:42 hv-1 corosync[12564]: [CMAN ] CMAN
3.0.12.1 (built
> Sep 3 2013 09:17:34) started
> Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service
engine
loaded:
> corosync CMAN membership service 2.90
> Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service
engine
loaded:
> openais checkpoint service B.01.01
> Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service
engine
loaded:
> corosync extended virtual synchrony service
> Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service
engine
loaded:
> corosync configuration service
> Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service
engine
loaded:
> corosync cluster closed process group service v1.01
> Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service
engine
loaded:
> corosync cluster config database access v1.01
> Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service
engine
loaded:
> corosync profile loading service
> Jan 7 21:37:42 hv-1 corosync[12564]: [QUORUM] Using
quorum
provider
> quorum_cman
> Jan 7 21:37:42 hv-1 corosync[12564]: [SERV ] Service
engine
loaded:
> corosync cluster quorum service v0.1
> Jan 7 21:37:42 hv-1 corosync[12564]: [MAIN ]
Compatibility
mode set
> to whitetank. Using V1 and V2 of the synchronization
engine.
> Jan 7 21:37:42 hv-1 corosync[12564]: [TOTEM ] adding
new UDPU
member
> {10.14.18.65}
> Jan 7 21:37:42 hv-1 corosync[12564]: [TOTEM ] adding
new UDPU
member
> {10.14.18.67}
> Jan 7 21:37:42 hv-1 corosync[12564]: [TOTEM ] adding
new UDPU
member
> {10.14.18.68}
> Jan 7 21:37:42 hv-1 corosync[12564]: [TOTEM ] adding
new UDPU
member
> {10.14.18.70}
> Jan 7 21:37:42 hv-1 corosync[12564]: [TOTEM ] adding
new UDPU
member
> {10.14.18.66}
> Jan 7 21:37:42 hv-1 corosync[12564]: [TOTEM ] adding
new UDPU
member
> {10.14.18.77}
> Jan 7 21:37:42 hv-1 corosync[12564]: [TOTEM ] A
processor
joined or
> left the membership and a new membership was formed.
> Jan 7 21:37:42 hv-1 corosync[12564]: [CMAN ] quorum
regained,
> resuming activity
> Jan 7 21:37:42 hv-1 corosync[12564]: [QUORUM] This
node is
within the
> primary component and will provide service.
> Jan 7 21:37:42 hv-1 corosync[12564]: [QUORUM]
Members[1]: 6
> Jan 7 21:37:42 hv-1 corosync[12564]: [QUORUM]
Members[1]: 6
> Jan 7 21:37:42 hv-1 corosync[12564]: [CPG ] chosen
downlist:
sender
> r(0) ip(10.14.18.77) ; members(old:0 left:0)
> Jan 7 21:37:42 hv-1 corosync[12564]: [MAIN ]
Completed service
> synchronization, ready to provide service.
> Jan 7 21:37:46 hv-1 fenced[12620]: fenced 3.0.12.1 started
> Jan 7 21:37:46 hv-1 dlm_controld[12643]: dlm_controld
3.0.12.1
started
> Jan 7 21:37:47 hv-1 gfs_controld[12695]: gfs_controld
3.0.12.1
started
> Jan 7 21:37:54 hv-1 fenced[12620]: fencing node hv-b1clcy1
>
> sudo -i corosync-objctl |grep member
>
> totem.interface.member.__memberaddr=hv-1
> totem.interface.member.__memberaddr=hv-2
> totem.interface.member.__memberaddr=hv-3
> totem.interface.member.__memberaddr=hv-4
> totem.interface.member.__memberaddr=hv-5
> totem.interface.member.__memberaddr=hv-6
> runtime.totem.pg.mrp.srp.__members.6.ip=r(0) ip(10.14.18.77)
> runtime.totem.pg.mrp.srp.__members.6.join_count=1
> runtime.totem.pg.mrp.srp.__members.6.status=joined
>
>
> Existing Node
> =============
>
> member 6 has not been added to the quorum list :
>
> Jan 7 21:36:28 hv-1 corosync[7769]: [QUORUM]
Members[4]: 1 2 3 5
> Jan 7 21:37:54 hv-1 corosync[7769]: [TOTEM ] A
processor joined or
> left the membership and a new membership was formed.
> Jan 7 21:37:54 hv-1 corosync[7769]: [CPG ] chosen
downlist:
sender
> r(0) ip(10.14.18.65) ; members(old:4 left:0)
>
>
> Node Sts Inc Joined Name
> 1 M 4468 2013-12-10 14:33:27 hv-1
> 2 M 4468 2013-12-10 14:33:27 hv-2
> 3 M 5036 2014-01-07 17:51:26 hv-3
> 4 X 4468 hv-4(dead at the
moment)
> 5 M 4468 2013-12-10 14:33:27 hv-5
> 6 X 0 hv-6<--- added
>
>
> Jan 7 21:36:28 hv-1 corosync[7769]: [QUORUM]
Members[4]: 1 2 3 5
> Jan 7 21:37:54 hv-1 corosync[7769]: [TOTEM ] A
processor joined or
> left the membership and a new membership was formed.
> Jan 7 21:37:54 hv-1 corosync[7769]: [CPG ] chosen
downlist:
sender
> r(0) ip(10.14.18.65) ; members(old:4 left:0)
> Jan 7 21:37:54 hv-1 corosync[7769]: [MAIN ]
Completed service
> synchronization, ready to provide service.
>
>
> totem.interface.member.__memberaddr=hv-1
> totem.interface.member.__memberaddr=hv-2
> totem.interface.member.__memberaddr=hv-3
> totem.interface.member.__memberaddr=hv-4
> totem.interface.member.__memberaddr=hv-5.
> runtime.totem.pg.mrp.srp.__members.1.ip=r(0) ip(10.14.18.65)
> runtime.totem.pg.mrp.srp.__members.1.join_count=1
> runtime.totem.pg.mrp.srp.__members.1.status=joined
> runtime.totem.pg.mrp.srp.__members.2.ip=r(0) ip(10.14.18.66)
> runtime.totem.pg.mrp.srp.__members.2.join_count=1
> runtime.totem.pg.mrp.srp.__members.2.status=joined
> runtime.totem.pg.mrp.srp.__members.4.ip=r(0) ip(10.14.18.68)
> runtime.totem.pg.mrp.srp.__members.4.join_count=1
> runtime.totem.pg.mrp.srp.__members.4.status=left
> runtime.totem.pg.mrp.srp.__members.5.ip=r(0) ip(10.14.18.70)
> runtime.totem.pg.mrp.srp.__members.5.join_count=1
> runtime.totem.pg.mrp.srp.__members.5.status=joined
> runtime.totem.pg.mrp.srp.__members.3.ip=r(0) ip(10.14.18.67)
> runtime.totem.pg.mrp.srp.__members.3.join_count=3
> runtime.totem.pg.mrp.srp.__members.3.status=joined
>
>
> cluster.conf:
>
> <?xml version="1.0"?>
> <cluster config_version="32" name="hv-1618-110-1">
> <fence_daemon clean_start="0"/>
> <cman transport="udpu" expected_votes="1"/>
Setting expected_votes to 1 in a six node cluster is a serious
configuration error and needs to be changed. That is what is causing
the new node to fence the rest of the cluster.
Check that all of the nodes have the same cluster.conf file, any
difference between that on the exiting nodes and the new one will
prevent the new node from joining too.
Chrissie
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster