Alessandro,
actually everything behaves perfectly as it should. As it can be seen
from logged message "Feb 21 04:41:33 corosync [CMAN ] memb: Sending
KILL to node 2" cman om nespolo-ext is killing fico-mail cman/corosync
and this results in Library error (2) on fico-mail. This is perfectly
valid. Sorry I didn't noticed this in previous log.
There is nothing corosync can do with this problem.
I can only recommend you to put backups in cgroups and lower io/cpu
speed as much as possible + increase token timeout. Together with
properly configured fencing, worst thing which can happen is that from
time to time (depending on how well you will be able to set token
timeout) fico-mail VM will be fenced, restarted and then rejoined
cluster. Downtime should be minimal.
Regards,
Honza
Alessandro Bono napsal(a):
Hi Honza
attached log from another cluster
on primary node
grep Library corosync-fico-mail-20140221.log
Feb 21 04:41:33 [27122] fico-mail crmd: error:
pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Feb 21 04:41:33 [27120] fico-mail attrd: error:
pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Feb 21 04:41:33 [27118] fico-mail stonith-ng: error:
pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Feb 21 04:41:33 [27117] fico-mail cib: error:
pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Feb 21 04:41:35 [27111] fico-mail pacemakerd: info: crm_cs_flush:
Sent 0 CPG messages (1 remaining, last=10): Library error (2)
Feb 21 04:41:35 [27111] fico-mail pacemakerd: error:
pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
same story on secondary node
egrep "pause|scheduled" corosync-nespolo-ext-20140221.log
Feb 21 04:41:27 corosync [TOTEM ] Process pause detected for 5011 ms,
flushing membership messages.
Feb 21 04:41:27 corosync [MAIN ] Corosync main process was not
scheduled for 8759.2314 ms (threshold is 2400.0000 ms). Consider token
timeout increase.
Feb 21 04:41:38 corosync [TOTEM ] Process pause detected for 1955 ms,
flushing membership messages.
secondary node is on an old and slow host used for backup and it's not
easy to solve perfomance problem
cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="1" name="mail_cluster">
<cman two_node="1" expected_votes="1"/>
<totem token="3000" consensus="5000" />
<logging>
<logging_daemon name="corosync" debug="on"/>
</logging>
<clusternodes>
<clusternode name="nespolo-ext" nodeid="1"/>
<clusternode name="fico-mail" nodeid="2"/>
</clusternodes>
</cluster>
crm configure show
node fico-mail
node nespolo-ext \
attributes standby="off"
primitive ClusterIP ocf:heartbeat:IPaddr \
params ip="10.153.24.4" cidr_netmask="24" \
op monitor interval="30s"
primitive SharedFS ocf:heartbeat:Filesystem \
params device="/dev/drbd/by-res/r0" directory="/shared"
fstype="ext4" options="noatime,nobarrier"
primitive drbd0 ocf:linbit:drbd \
params drbd_resource="r0" \
op monitor interval="15s"
primitive drbdlinks ocf:tummy:drbdlinks \
meta target-role="Started"
primitive mysql lsb:mysqld
group service_group ClusterIP SharedFS drbdlinks mysql
ms ms_drbd0 drbd0 \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true"
location prefer-master service_group 1: fico-mail
colocation service_on_drbd inf: service_group ms_drbd0:Master
order service_after_drbd inf: ms_drbd0:promote service_group:start
property $id="cib-bootstrap-options" \
dc-version="1.1.10-14.el6_5.2-368c726" \
cluster-infrastructure="cman" \
expected-quorum-votes="2" \
stonith-enabled="false" \
no-quorum-policy="ignore" \
last-lrm-refresh="1392973313" \
maintenance-mode="false"
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss