Hi
after changing cluster from corosync to cman+corosync (switching from
centos 6.3 to 6.4) I have a recurring problem with pacemaker/corosync
pacemaker report this error
pacemakerd: error: pcmk_cpg_dispatch: Connection to the CPG API
failed: Library error (2)
and shutdown itself
This normally happens when the host machine is under high load, at
example during a full backup
in addition, there are a lot of these messages
Feb 01 23:27:04 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 23:27:04 corosync [TOTEM ] Automatically recovered ring 1
Feb 01 23:27:06 corosync [TOTEM ] Marking ringid 0 interface 10.12.32.1
FAULTY
Feb 01 23:27:07 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 23:27:07 corosync [TOTEM ] Automatically recovered ring 0
Feb 01 23:27:07 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 23:27:07 corosync [TOTEM ] Automatically recovered ring 0
Feb 01 23:27:09 corosync [TOTEM ] Marking ringid 1 interface 10.12.23.1
FAULTY
Feb 01 23:27:10 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 23:27:10 corosync [TOTEM ] Automatically recovered ring 1
Feb 01 23:27:10 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 23:27:10 corosync [TOTEM ] Automatically recovered ring 0
Feb 01 23:27:12 corosync [TOTEM ] Marking ringid 1 interface 10.12.23.1
FAULTY
Feb 01 23:27:12 corosync [TOTEM ] Marking ringid 0 interface 10.12.32.1
FAULTY
Feb 01 23:27:13 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 23:27:13 corosync [TOTEM ] received message requesting test of
ring now active
I reported this problem to pacemaker ml but they said it's a corosync
problem
same problem with centos 6.5
I tried to switch comunication to udpu and add another comunication path
but without any luck
cluster nodes are kvm virtual machine
Is it a configuration problem?
some info below, I can provide full log if necessary
rpm -qa | egrep "pacem|coro"| sort
corosync-1.4.1-17.el6.x86_64
corosynclib-1.4.1-17.el6.x86_64
drbd-pacemaker-8.3.16-1.el6.x86_64
pacemaker-1.1.10-14.el6_5.2.x86_64
pacemaker-cli-1.1.10-14.el6_5.2.x86_64
pacemaker-cluster-libs-1.1.10-14.el6_5.2.x86_64
pacemaker-debuginfo-1.1.10-1.el6.x86_64
pacemaker-libs-1.1.10-14.el6_5.2.x86_64
cat /etc/cluster/cluster.conf
<cluster config_version="8" name="ga-ext_cluster">
<cman transport="udpu"/>
<logging>
<logging_daemon name="corosync" debug="on"/>
</logging>
<clusternodes>
<clusternode name="ga1-ext" nodeid="1">
<fence>
<method name="pcmk-redirect">
<device name="pcmk" port="ga1-ext"/>
</method>
</fence>
<altname name="ga1-ext_alt"/>
</clusternode>
<clusternode name="ga2-ext" nodeid="2">
<fence>
<method name="pcmk-redirect">
<device name="pcmk" port="ga2-ext"/>
</method>
</fence>
<altname name="ga2-ext_alt"/>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice agent="fence_pcmk" name="pcmk"/>
</fencedevices>
</cluster>
crm configure show
node ga1-ext \
attributes standby="off"
node ga2-ext \
attributes standby="off"
primitive ClusterIP ocf:heartbeat:IPaddr \
params ip="10.12.23.3" cidr_netmask="24" \
op monitor interval="30s"
primitive SharedFS ocf:heartbeat:Filesystem \
params device="/dev/drbd/by-res/r0" directory="/shared"
fstype="ext4" options="noatime,nobarrier"
primitive dovecot lsb:dovecot
primitive drbd0 ocf:linbit:drbd \
params drbd_resource="r0" \
op monitor interval="15s"
primitive drbdlinks ocf:tummy:drbdlinks
primitive mail ocf:heartbeat:MailTo \
params email="root@xxxxxxxxxxxxxxxxxxxx" subject="ga-ext cluster - "
primitive mysql lsb:mysqld
group service_group SharedFS drbdlinks ClusterIP mail mysql dovecot \
meta target-role="Started"
ms ms_drbd0 drbd0 \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true"
colocation service_on_drbd inf: service_group ms_drbd0:Master
order service_after_drbd inf: ms_drbd0:promote service_group:start
property $id="cib-bootstrap-options" \
dc-version="1.1.10-14.el6_5.2-368c726" \
cluster-infrastructure="cman" \
expected-quorum-votes="2" \
stonith-enabled="false" \
no-quorum-policy="ignore" \
last-lrm-refresh="1391290945" \
maintenance-mode="false"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"
extract from cluster.log
Feb 01 21:40:15 corosync [MAIN ] Completed service synchronization,
ready to provide service.
Feb 01 21:40:15 corosync [TOTEM ] waiting_trans_ack changed to 0
Feb 01 21:40:15 corosync [TOTEM ] Marking ringid 1 interface 10.12.23.1
FAULTY
Feb 01 21:40:15 [13253] ga1-ext cib: info: crm_cs_flush:
Sent 4 CPG messages (0 remaining, last=48): OK (1)
Feb 01 21:40:15 [13256] ga1-ext crmd: info: crm_cs_flush:
Sent 3 CPG messages (0 remaining, last=24): OK (1)
Feb 01 21:40:16 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 21:40:16 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 21:40:16 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 0
Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 1
Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 1
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_diff: Diff 0.299.3 -> 0.299.4 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_request: Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='f
ail-count-drbd0']: No such device or address (rc=-6,
origin=local/attrd/34, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_request: Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l
ast-failure-mysql']: No such device or address (rc=-6,
origin=local/attrd/35, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_request: Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l
ast-failure-drbd0']: No such device or address (rc=-6,
origin=local/attrd/36, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_diff: Diff 0.299.4 -> 0.299.5 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_diff: Diff 0.299.5 -> 0.299.6 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_diff: Diff 0.299.6 -> 0.299.7 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_diff: Diff 0.299.7 -> 0.299.8 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_diff: Diff 0.299.8 -> 0.299.9 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_request: Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='m
aster-drbd0']: OK (rc=0, origin=local/attrd/37, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_request: Completed cib_modify operation for section
status: OK (rc=0, origin=local/attrd/38, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_request: Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l
ast-failure-ClusterIP']: No such device or address (rc=-6,
origin=local/attrd/39, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_request: Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='p
robe_complete']: OK (rc=0, origin=local/attrd/40, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_request: Completed cib_modify operation for section
status: OK (rc=0, origin=local/attrd/41, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_request: Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='m
aster-drbd0']: OK (rc=0, origin=local/attrd/42, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_request: Completed cib_modify operation for section
status: OK (rc=0, origin=local/attrd/43, version=0.299.11)
Feb 01 21:40:17 [13256] ga1-ext crmd: info:
register_fsa_error_adv: Resetting the current action list
Feb 01 21:40:17 [13256] ga1-ext crmd: warning:
crmd_ha_msg_filter: Another DC detected: ga2-ext (op=noop)
Feb 01 21:40:17 [13256] ga1-ext crmd: info:
register_fsa_error_adv: Resetting the current action list
Feb 01 21:40:17 [13256] ga1-ext crmd: warning:
crmd_ha_msg_filter: Another DC detected: ga2-ext (op=noop)
Feb 01 21:40:17 corosync [CMAN ] ais: deliver_fn source nodeid = 2,
len=24, endian_conv=0
Feb 01 21:40:17 corosync [CMAN ] memb: Message on port 0 is 6
Feb 01 21:40:17 corosync [CMAN ] memb: got KILL for node 1
Feb 01 21:40:17 [13256] ga1-ext crmd: info:
register_fsa_error_adv: Resetting the current action list
Feb 01 21:40:17 [13256] ga1-ext crmd: warning:
crmd_ha_msg_filter: Another DC detected: ga2-ext (op=noop)
Feb 01 21:40:17 [13256] ga1-ext crmd: info:
register_fsa_error_adv: Resetting the current action list
Feb 01 21:40:17 [13256] ga1-ext crmd: warning:
crmd_ha_msg_filter: Another DC detected: ga2-ext (op=join_offer)
Feb 01 21:40:17 [13256] ga1-ext crmd: info:
do_state_transition: State transition S_INTEGRATION -> S_ELECTION [
input=I_ELECTION cause=C_FSA_INTERNAL origin=crmd_ha_msg_filter ]
Feb 01 21:40:17 [13256] ga1-ext crmd: info: update_dc:
Unset DC. Was ga1-ext
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_diff: Diff 0.299.9 -> 0.299.10 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_diff: Diff 0.299.10 -> 0.299.11 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:18 [13247] ga1-ext pacemakerd: error:
pcmk_cpg_dispatch: Connection to the CPG API failed: Library
error (2)
Feb 01 21:40:18 [13247] ga1-ext pacemakerd: error: mcp_cpg_destroy:
Connection destroyed
Feb 01 21:40:18 [13247] ga1-ext pacemakerd: info: crm_xml_cleanup:
Cleaning up memory from libxml2
Feb 01 21:40:18 [13255] ga1-ext attrd: error:
pcmk_cpg_dispatch: Connection to the CPG API failed: Library
error (2)
Feb 01 21:40:18 [13255] ga1-ext attrd: crit:
attrd_cs_destroy: Lost connection to Corosync service!
Feb 01 21:40:18 [13255] ga1-ext attrd: notice: main: Exiting...
Feb 01 21:40:18 [13255] ga1-ext attrd: notice: main:
Disconnecting client 0x238ff10, pid=13256...
Feb 01 21:40:18 [13255] ga1-ext attrd: error:
attrd_cib_connection_destroy: Connection to the CIB terminated...
Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info:
stonith_shutdown: Terminating with 1 clients
Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info:
cib_connection_destroy: Connection to the CIB closed.
Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info:
crm_client_destroy: Destroying 0 events
Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info:
qb_ipcs_us_withdraw: withdrawing server sockets
Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info: main: Done
Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info: crm_xml_cleanup:
Cleaning up memory from libxml2
Feb 01 21:40:18 [13256] ga1-ext crmd: error:
pcmk_cpg_dispatch: Connection to the CPG API failed: Library
error (2)
Feb 01 21:40:18 [13256] ga1-ext crmd: error: crmd_cs_destroy:
connection terminated
Feb 01 21:40:18 [13256] ga1-ext crmd: info:
qb_ipcs_us_withdraw: withdrawing server sockets
Feb 01 21:40:18 [13253] ga1-ext cib: error:
pcmk_cpg_dispatch: Connection to the CPG API failed: Library
error (2)
Feb 01 21:40:18 [13253] ga1-ext cib: error: cib_cs_destroy:
Corosync connection lost! Exiting.
Feb 01 21:40:18 [13253] ga1-ext cib: info: terminate_cib:
cib_cs_destroy: Exiting fast...
Feb 01 21:40:18 [13253] ga1-ext cib: info:
qb_ipcs_us_withdraw: withdrawing server sockets
Feb 01 21:40:18 [13253] ga1-ext cib: info:
crm_client_destroy: Destroying 0 events
Feb 01 21:40:18 [13253] ga1-ext cib: info:
crm_client_destroy: Destroying 0 events
Feb 01 21:40:18 [13253] ga1-ext cib: info:
qb_ipcs_us_withdraw: withdrawing server sockets
Feb 01 21:40:18 [13253] ga1-ext cib: info:
crm_client_destroy: Destroying 0 events
Feb 01 21:40:18 [13253] ga1-ext cib: info:
qb_ipcs_us_withdraw: withdrawing server sockets
Feb 01 21:40:18 [13253] ga1-ext cib: info: crm_xml_cleanup:
Cleaning up memory from libxml2
Feb 01 21:40:18 [13256] ga1-ext crmd: info:
tengine_stonith_connection_destroy: Fencing daemon disconnected
Feb 01 21:40:18 [13256] ga1-ext crmd: notice: crmd_exit:
Forcing immediate exit: Link has been severed (67)
Feb 01 21:40:18 [13256] ga1-ext crmd: info: crm_xml_cleanup:
Cleaning up memory from libxml2
Feb 01 21:40:18 [25258] ga1-ext lrmd: info:
cancel_recurring_action: Cancelling operation ClusterIP_monitor_30000
Feb 01 21:40:18 [25258] ga1-ext lrmd: warning:
qb_ipcs_event_sendv: new_event_notification (25258-13256-6): Bad
file descriptor (9)
Feb 01 21:40:18 [25258] ga1-ext lrmd: warning:
send_client_notify: Notification of client
crmd/0b3ea733-7340-439c-9f46-81b0d7e1f6a1 failed
Feb 01 21:40:18 [25258] ga1-ext lrmd: info:
crm_client_destroy: Destroying 1 events
Feb 01 21:40:18 [25260] ga1-ext pengine: info:
crm_client_destroy: Destroying 0 events
--
Cordiali saluti
Alessandro Bono
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss