pacemaker "CPG API: failed Library error"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi

after changing cluster from corosync to cman+corosync (switching from centos 6.3 to 6.4) I have a recurring problem with pacemaker/corosync
pacemaker report this error

pacemakerd: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)

and shutdown itself
This normally happens when the host machine is under high load, at example during a full backup

in addition, there are a lot of these messages

Feb 01 23:27:04 corosync [TOTEM ] received message requesting test of ring now active
Feb 01 23:27:04 corosync [TOTEM ] Automatically recovered ring 1
Feb 01 23:27:06 corosync [TOTEM ] Marking ringid 0 interface 10.12.32.1 FAULTY Feb 01 23:27:07 corosync [TOTEM ] received message requesting test of ring now active
Feb 01 23:27:07 corosync [TOTEM ] Automatically recovered ring 0
Feb 01 23:27:07 corosync [TOTEM ] received message requesting test of ring now active
Feb 01 23:27:07 corosync [TOTEM ] Automatically recovered ring 0
Feb 01 23:27:09 corosync [TOTEM ] Marking ringid 1 interface 10.12.23.1 FAULTY Feb 01 23:27:10 corosync [TOTEM ] received message requesting test of ring now active
Feb 01 23:27:10 corosync [TOTEM ] Automatically recovered ring 1
Feb 01 23:27:10 corosync [TOTEM ] received message requesting test of ring now active
Feb 01 23:27:10 corosync [TOTEM ] Automatically recovered ring 0
Feb 01 23:27:12 corosync [TOTEM ] Marking ringid 1 interface 10.12.23.1 FAULTY Feb 01 23:27:12 corosync [TOTEM ] Marking ringid 0 interface 10.12.32.1 FAULTY Feb 01 23:27:13 corosync [TOTEM ] received message requesting test of ring now active Feb 01 23:27:13 corosync [TOTEM ] received message requesting test of ring now active

I reported this problem to pacemaker ml but they said it's a corosync problem
same problem with centos 6.5

I tried to switch comunication to udpu and add another comunication path but without any luck
cluster nodes are kvm virtual machine

Is it a configuration problem?

some info below, I can provide full log if necessary

rpm -qa  | egrep "pacem|coro"| sort
corosync-1.4.1-17.el6.x86_64
corosynclib-1.4.1-17.el6.x86_64
drbd-pacemaker-8.3.16-1.el6.x86_64
pacemaker-1.1.10-14.el6_5.2.x86_64
pacemaker-cli-1.1.10-14.el6_5.2.x86_64
pacemaker-cluster-libs-1.1.10-14.el6_5.2.x86_64
pacemaker-debuginfo-1.1.10-1.el6.x86_64
pacemaker-libs-1.1.10-14.el6_5.2.x86_64


cat /etc/cluster/cluster.conf
<cluster config_version="8" name="ga-ext_cluster">
<cman transport="udpu"/>
  <logging>
   <logging_daemon name="corosync" debug="on"/>
  </logging>
  <clusternodes>
    <clusternode name="ga1-ext" nodeid="1">
      <fence>
        <method name="pcmk-redirect">
          <device name="pcmk" port="ga1-ext"/>
        </method>
      </fence>
      <altname name="ga1-ext_alt"/>
    </clusternode>
    <clusternode name="ga2-ext" nodeid="2">
      <fence>
        <method name="pcmk-redirect">
          <device name="pcmk" port="ga2-ext"/>
        </method>
      </fence>
      <altname name="ga2-ext_alt"/>
    </clusternode>
  </clusternodes>
  <fencedevices>
    <fencedevice agent="fence_pcmk" name="pcmk"/>
  </fencedevices>
</cluster>

crm configure show
node ga1-ext \
    attributes standby="off"
node ga2-ext \
    attributes standby="off"
primitive ClusterIP ocf:heartbeat:IPaddr \
    params ip="10.12.23.3" cidr_netmask="24" \
    op monitor interval="30s"
primitive SharedFS ocf:heartbeat:Filesystem \
params device="/dev/drbd/by-res/r0" directory="/shared" fstype="ext4" options="noatime,nobarrier"
primitive dovecot lsb:dovecot
primitive drbd0 ocf:linbit:drbd \
    params drbd_resource="r0" \
    op monitor interval="15s"
primitive drbdlinks ocf:tummy:drbdlinks
primitive mail ocf:heartbeat:MailTo \
    params email="root@xxxxxxxxxxxxxxxxxxxx" subject="ga-ext cluster - "
primitive mysql lsb:mysqld
group service_group SharedFS drbdlinks ClusterIP mail mysql dovecot \
    meta target-role="Started"
ms ms_drbd0 drbd0 \
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true"
colocation service_on_drbd inf: service_group ms_drbd0:Master
order service_after_drbd inf: ms_drbd0:promote service_group:start
property $id="cib-bootstrap-options" \
    dc-version="1.1.10-14.el6_5.2-368c726" \
    cluster-infrastructure="cman" \
    expected-quorum-votes="2" \
    stonith-enabled="false" \
    no-quorum-policy="ignore" \
    last-lrm-refresh="1391290945" \
    maintenance-mode="false"
rsc_defaults $id="rsc-options" \
    resource-stickiness="100"

extract from cluster.log

Feb 01 21:40:15 corosync [MAIN ] Completed service synchronization, ready to provide service.
Feb 01 21:40:15 corosync [TOTEM ] waiting_trans_ack changed to 0
Feb 01 21:40:15 corosync [TOTEM ] Marking ringid 1 interface 10.12.23.1 FAULTY Feb 01 21:40:15 [13253] ga1-ext cib: info: crm_cs_flush: Sent 4 CPG messages (0 remaining, last=48): OK (1) Feb 01 21:40:15 [13256] ga1-ext crmd: info: crm_cs_flush: Sent 3 CPG messages (0 remaining, last=24): OK (1) Feb 01 21:40:16 corosync [TOTEM ] received message requesting test of ring now active Feb 01 21:40:16 corosync [TOTEM ] received message requesting test of ring now active Feb 01 21:40:16 corosync [TOTEM ] received message requesting test of ring now active
Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 0
Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 1
Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 1
Feb 01 21:40:17 [13253] ga1-ext cib: info: cib_process_diff: Diff 0.299.3 -> 0.299.4 from ga2-ext not applied to 0.299.11: current "num_updates" is greater than required Feb 01 21:40:17 [13253] ga1-ext cib: info: cib_process_request: Completed cib_query operation for section //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='f ail-count-drbd0']: No such device or address (rc=-6, origin=local/attrd/34, version=0.299.11) Feb 01 21:40:17 [13253] ga1-ext cib: info: cib_process_request: Completed cib_query operation for section //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l ast-failure-mysql']: No such device or address (rc=-6, origin=local/attrd/35, version=0.299.11) Feb 01 21:40:17 [13253] ga1-ext cib: info: cib_process_request: Completed cib_query operation for section //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l ast-failure-drbd0']: No such device or address (rc=-6, origin=local/attrd/36, version=0.299.11) Feb 01 21:40:17 [13253] ga1-ext cib: info: cib_process_diff: Diff 0.299.4 -> 0.299.5 from ga2-ext not applied to 0.299.11: current "num_updates" is greater than required Feb 01 21:40:17 [13253] ga1-ext cib: info: cib_process_diff: Diff 0.299.5 -> 0.299.6 from ga2-ext not applied to 0.299.11: current "num_updates" is greater than required Feb 01 21:40:17 [13253] ga1-ext cib: info: cib_process_diff: Diff 0.299.6 -> 0.299.7 from ga2-ext not applied to 0.299.11: current "num_updates" is greater than required Feb 01 21:40:17 [13253] ga1-ext cib: info: cib_process_diff: Diff 0.299.7 -> 0.299.8 from ga2-ext not applied to 0.299.11: current "num_updates" is greater than required Feb 01 21:40:17 [13253] ga1-ext cib: info: cib_process_diff: Diff 0.299.8 -> 0.299.9 from ga2-ext not applied to 0.299.11: current "num_updates" is greater than required Feb 01 21:40:17 [13253] ga1-ext cib: info: cib_process_request: Completed cib_query operation for section //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='m
aster-drbd0']: OK (rc=0, origin=local/attrd/37, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=local/attrd/38, version=0.299.11) Feb 01 21:40:17 [13253] ga1-ext cib: info: cib_process_request: Completed cib_query operation for section //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l ast-failure-ClusterIP']: No such device or address (rc=-6, origin=local/attrd/39, version=0.299.11) Feb 01 21:40:17 [13253] ga1-ext cib: info: cib_process_request: Completed cib_query operation for section //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='p
robe_complete']: OK (rc=0, origin=local/attrd/40, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=local/attrd/41, version=0.299.11) Feb 01 21:40:17 [13253] ga1-ext cib: info: cib_process_request: Completed cib_query operation for section //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='m
aster-drbd0']: OK (rc=0, origin=local/attrd/42, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=local/attrd/43, version=0.299.11) Feb 01 21:40:17 [13256] ga1-ext crmd: info: register_fsa_error_adv: Resetting the current action list Feb 01 21:40:17 [13256] ga1-ext crmd: warning: crmd_ha_msg_filter: Another DC detected: ga2-ext (op=noop) Feb 01 21:40:17 [13256] ga1-ext crmd: info: register_fsa_error_adv: Resetting the current action list Feb 01 21:40:17 [13256] ga1-ext crmd: warning: crmd_ha_msg_filter: Another DC detected: ga2-ext (op=noop) Feb 01 21:40:17 corosync [CMAN ] ais: deliver_fn source nodeid = 2, len=24, endian_conv=0
Feb 01 21:40:17 corosync [CMAN  ] memb: Message on port 0 is 6
Feb 01 21:40:17 corosync [CMAN  ] memb: got KILL for node 1
Feb 01 21:40:17 [13256] ga1-ext crmd: info: register_fsa_error_adv: Resetting the current action list Feb 01 21:40:17 [13256] ga1-ext crmd: warning: crmd_ha_msg_filter: Another DC detected: ga2-ext (op=noop) Feb 01 21:40:17 [13256] ga1-ext crmd: info: register_fsa_error_adv: Resetting the current action list Feb 01 21:40:17 [13256] ga1-ext crmd: warning: crmd_ha_msg_filter: Another DC detected: ga2-ext (op=join_offer) Feb 01 21:40:17 [13256] ga1-ext crmd: info: do_state_transition: State transition S_INTEGRATION -> S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=crmd_ha_msg_filter ] Feb 01 21:40:17 [13256] ga1-ext crmd: info: update_dc: Unset DC. Was ga1-ext Feb 01 21:40:17 [13253] ga1-ext cib: info: cib_process_diff: Diff 0.299.9 -> 0.299.10 from ga2-ext not applied to 0.299.11: current "num_updates" is greater than required Feb 01 21:40:17 [13253] ga1-ext cib: info: cib_process_diff: Diff 0.299.10 -> 0.299.11 from ga2-ext not applied to 0.299.11: current "num_updates" is greater than required Feb 01 21:40:18 [13247] ga1-ext pacemakerd: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Feb 01 21:40:18 [13247] ga1-ext pacemakerd: error: mcp_cpg_destroy: Connection destroyed Feb 01 21:40:18 [13247] ga1-ext pacemakerd: info: crm_xml_cleanup: Cleaning up memory from libxml2 Feb 01 21:40:18 [13255] ga1-ext attrd: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Feb 01 21:40:18 [13255] ga1-ext attrd: crit: attrd_cs_destroy: Lost connection to Corosync service!
Feb 01 21:40:18 [13255] ga1-ext      attrd:   notice: main: Exiting...
Feb 01 21:40:18 [13255] ga1-ext attrd: notice: main: Disconnecting client 0x238ff10, pid=13256... Feb 01 21:40:18 [13255] ga1-ext attrd: error: attrd_cib_connection_destroy: Connection to the CIB terminated... Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info: stonith_shutdown: Terminating with 1 clients Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info: cib_connection_destroy: Connection to the CIB closed. Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info: crm_client_destroy: Destroying 0 events Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info: qb_ipcs_us_withdraw: withdrawing server sockets
Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info: main:     Done
Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info: crm_xml_cleanup: Cleaning up memory from libxml2 Feb 01 21:40:18 [13256] ga1-ext crmd: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Feb 01 21:40:18 [13256] ga1-ext crmd: error: crmd_cs_destroy: connection terminated Feb 01 21:40:18 [13256] ga1-ext crmd: info: qb_ipcs_us_withdraw: withdrawing server sockets Feb 01 21:40:18 [13253] ga1-ext cib: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Feb 01 21:40:18 [13253] ga1-ext cib: error: cib_cs_destroy: Corosync connection lost! Exiting. Feb 01 21:40:18 [13253] ga1-ext cib: info: terminate_cib: cib_cs_destroy: Exiting fast... Feb 01 21:40:18 [13253] ga1-ext cib: info: qb_ipcs_us_withdraw: withdrawing server sockets Feb 01 21:40:18 [13253] ga1-ext cib: info: crm_client_destroy: Destroying 0 events Feb 01 21:40:18 [13253] ga1-ext cib: info: crm_client_destroy: Destroying 0 events Feb 01 21:40:18 [13253] ga1-ext cib: info: qb_ipcs_us_withdraw: withdrawing server sockets Feb 01 21:40:18 [13253] ga1-ext cib: info: crm_client_destroy: Destroying 0 events Feb 01 21:40:18 [13253] ga1-ext cib: info: qb_ipcs_us_withdraw: withdrawing server sockets Feb 01 21:40:18 [13253] ga1-ext cib: info: crm_xml_cleanup: Cleaning up memory from libxml2 Feb 01 21:40:18 [13256] ga1-ext crmd: info: tengine_stonith_connection_destroy: Fencing daemon disconnected Feb 01 21:40:18 [13256] ga1-ext crmd: notice: crmd_exit: Forcing immediate exit: Link has been severed (67) Feb 01 21:40:18 [13256] ga1-ext crmd: info: crm_xml_cleanup: Cleaning up memory from libxml2 Feb 01 21:40:18 [25258] ga1-ext lrmd: info: cancel_recurring_action: Cancelling operation ClusterIP_monitor_30000 Feb 01 21:40:18 [25258] ga1-ext lrmd: warning: qb_ipcs_event_sendv: new_event_notification (25258-13256-6): Bad file descriptor (9) Feb 01 21:40:18 [25258] ga1-ext lrmd: warning: send_client_notify: Notification of client crmd/0b3ea733-7340-439c-9f46-81b0d7e1f6a1 failed Feb 01 21:40:18 [25258] ga1-ext lrmd: info: crm_client_destroy: Destroying 1 events Feb 01 21:40:18 [25260] ga1-ext pengine: info: crm_client_destroy: Destroying 0 events

--
Cordiali saluti

Alessandro Bono

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss




[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux