Hi all, I have a problem whereby when I create a network split/partition (by dropping traffic with iptables), the victim node for some reason does not realise it has split from the network. If I split a cluster into two partitions both with multiple nodes, one with quorum and one without, then things function as expected; it just appears that a single node on it's own can't work out that it doesn't have quorum if it has no other nodes to talk to. A single victim node seems to recognise that it can't form a cluster due to network issues, but the status is not reflected in the output from corosync-quorumtool, and cluster services (via pacemaker) still continue to run. However, the other nodes in the rest of the cluster do realise they have lost contact with a node, no longer have quorum and correctly shut down services. When I block traffic on the victim node's eth0, The remaining nodes see that they cannot communicate with it and shutdown : # corosync-quorumtool -s Version: 1.4.5 Nodes: 3 Ring ID: 696 Quorum type: corosync_votequorum Quorate: No Node votes: 1 Expected votes: 7 Highest expected: 7 Total votes: 3 Quorum: 4 Activity blocked Flags: However, the victim node still thinks everything is fine, and maintains a view of the cluster prior to the split : # corosync-quorumtool -s Version: 1.4.5 Nodes: 4 Ring ID: 716 Quorum type: corosync_votequorum Quorate: Yes Node votes: 1 Expected votes: 7 Highest expected: 7 Total votes: 4 Quorum: 4 Flags: Quorate However, it does notice in the logs that it cannot now form cluster, as the following messages repeat constantly : corosync [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly. I would expect at this point for it to be in it's own network partition with a total of 1 vote, and block activity. However, this does not seem to happen until just after it rejoins the cluster. When I unblock traffic and it rejoins, I see the victim finally realise it had lost quorum : Sep 05 09:52:21 corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 720: memb=1, new=0, lost=3 Sep 05 09:52:21 corosync [VOTEQ ] quorum lost, blocking activity Sep 05 09:52:21 corosync [QUORUM] This node is within the non-primary component and will NOT provide any services. Sep 05 09:52:21 corosync [QUORUM] Members[1]: 358898186 And a second or so later it regains quorum : crmd: notice: ais_dispatch_message: Membership 736: quorum acquired So my question is why, when it realises it cannot form a cluster ("Totem in unable to form..."), does it not loose quorum, update the status as reflected by quorumtool and shutdown cluster services ? Configuration file example and package versions/environment listed below. I'm using "updu" protocol as we need to avoid multicast in this environment; it will eventually be using a routed network. This behaviour also persists when I disable the pacemaker plugin and just test with corosync. compatibility: whitetank totem { version: 2 secauth: off interface { member { memberaddr: 10.90.100.20 } member { memberaddr: 10.90.100.21 } ... ... more nodes snipped ... ringnumber: 0 bindnetaddr: 10.90.100.20 mcastport: 5405 } transport: udpu } amf { mode: disabled } aisexec { user: root group: root } quorum { provider: corosync_votequorum expected_votes: 7 } service { # Load the Pacemaker Cluster Resource Manager name: pacemaker ver: 0 } Environment : CentOS 6.4 Packages from OpenSUSE : http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/RedHat_RHEL-6/x86_64/ # rpm -qa | egrep "^(cluster|corosync|crm|libqb|pacemaker|resource-agents)" | sort cluster-glue-1.0.11-3.1.x86_64 cluster-glue-libs-1.0.11-3.1.x86_64 corosync-1.4.5-2.2.x86_64 corosynclib-1.4.5-2.2.x86_64 crmsh-1.2.6-0.rc3.3.1.x86_64 libqb0-0.14.4-1.2.x86_64 pacemaker-1.1.9-2.1.x86_64 pacemaker-cli-1.1.9-2.1.x86_64 pacemaker-cluster-libs-1.1.9-2.1.x86_64 pacemaker-libs-1.1.9-2.1.x86_64 resource-agents-3.9.5-3.1.x86_64 Regards, -Mark ________________________________ Mark Round Senior Systems Administrator NCC Group Kings Court Kingston Road Leatherhead, KT22 7SL Telephone: +44 1372 383815 Mobile: +44 7790 770413 Fax: Website: www.nccgroup.com<http://www.nccgroup.com> Email: Mark.Round@xxxxxxxxxxxx<mailto:Mark.Round@xxxxxxxxxxxx> [http://www.nccgroup.com/media/192418/nccgrouplogo.jpg] <http://www.nccgroup.com/> ________________________________ This email is sent for and on behalf of NCC Group. NCC Group is the trading name of NCC Group Performance Testing Limited (Registered in England CRN: 4069379). Registered Office: Manchester Technology Centre, Oxford Road, Manchester, M1 7EF. The ultimate holding company is NCC Group plc (Registered in England CRN: 4627044). Confidentiality: This e-mail contains proprietary information, some or all of which may be confidential and/or legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this e-mail, please notify the author by replying to this e-mail and then delete the original. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on any information contained in this e-mail. You must not inform any other person other than NCC Group or the sender of its existence. For more information about NCC Group please visit www.nccgroup.com<http://www.nccgroup.com> P Before you print think about the ENVIRONMENT For more information please visit <a href="http://www.mimecast.com">http://www.mimecast.com<br> This email message has been delivered safely and archived online by Mimecast. </a> _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss