Re: pacemaker "CPG API: failed Library error"

Jan Friesse <jfriesse@xxxxxxxxxx> · Mon, 10 Feb 2014 12:24:09 +0100

Alessandro Bono napsal(a):
> Il 10/02/14 10:47, Jan Friesse ha scritto:
>> Alessandro,
>> can you find message like "Corosync main process was not scheduled for
>> ... ms" in log file (corosync must be at least 1.4.1-16 so CentOS 6.5)?
> Hi
> 
> there is no a message like that in log file
> distro is centos 6.5

ok. So first thing to try is to remove redundant ring (just remove
altname tags) and see, if problem is still existing. If so, give a try
standard multicast (so remove udpu) but make sure to enable
multicast_querier (echo 1 >
/sys/class/net/$NETWORK_IFACE/bridge/multicast_querier, I'm using
libvirt qemu hook (/etc/libvirt/hooks/qemu) for that).

Regards,
  Honza

> 
> rpm -qa corosync
> corosync-1.4.1-17.el6.x86_64
> 
>>
>> Regards,
>>    Honza
>>
>> Alessandro Bono napsal(a):
>>> Hi
>>>
>>> after changing cluster from corosync to cman+corosync (switching from
>>> centos 6.3 to 6.4) I have a recurring problem with pacemaker/corosync
>>> pacemaker report this error
>>>
>>> pacemakerd:    error: pcmk_cpg_dispatch:     Connection to the CPG API
>>> failed: Library error (2)
>>>
>>> and shutdown itself
>>> This normally happens when the host machine is under high load, at
>>> example during a full backup
>>>
>>> in addition, there are a lot of these messages
>>>
>>> Feb 01 23:27:04 corosync [TOTEM ] received message requesting test of
>>> ring now active
>>> Feb 01 23:27:04 corosync [TOTEM ] Automatically recovered ring 1
>>> Feb 01 23:27:06 corosync [TOTEM ] Marking ringid 0 interface 10.12.32.1
>>> FAULTY
>>> Feb 01 23:27:07 corosync [TOTEM ] received message requesting test of
>>> ring now active
>>> Feb 01 23:27:07 corosync [TOTEM ] Automatically recovered ring 0
>>> Feb 01 23:27:07 corosync [TOTEM ] received message requesting test of
>>> ring now active
>>> Feb 01 23:27:07 corosync [TOTEM ] Automatically recovered ring 0
>>> Feb 01 23:27:09 corosync [TOTEM ] Marking ringid 1 interface 10.12.23.1
>>> FAULTY
>>> Feb 01 23:27:10 corosync [TOTEM ] received message requesting test of
>>> ring now active
>>> Feb 01 23:27:10 corosync [TOTEM ] Automatically recovered ring 1
>>> Feb 01 23:27:10 corosync [TOTEM ] received message requesting test of
>>> ring now active
>>> Feb 01 23:27:10 corosync [TOTEM ] Automatically recovered ring 0
>>> Feb 01 23:27:12 corosync [TOTEM ] Marking ringid 1 interface 10.12.23.1
>>> FAULTY
>>> Feb 01 23:27:12 corosync [TOTEM ] Marking ringid 0 interface 10.12.32.1
>>> FAULTY
>>> Feb 01 23:27:13 corosync [TOTEM ] received message requesting test of
>>> ring now active
>>> Feb 01 23:27:13 corosync [TOTEM ] received message requesting test of
>>> ring now active
>>>
>>> I reported this problem to pacemaker ml but they said it's a corosync
>>> problem
>>> same problem with centos 6.5
>>>
>>> I tried to switch comunication to udpu and add another comunication path
>>> but without any luck
>>> cluster nodes are kvm virtual machine
>>>
>>> Is it a configuration problem?
>>>
>>> some info below, I can provide full log if necessary
>>>
>>> rpm -qa  | egrep "pacem|coro"| sort
>>> corosync-1.4.1-17.el6.x86_64
>>> corosynclib-1.4.1-17.el6.x86_64
>>> drbd-pacemaker-8.3.16-1.el6.x86_64
>>> pacemaker-1.1.10-14.el6_5.2.x86_64
>>> pacemaker-cli-1.1.10-14.el6_5.2.x86_64
>>> pacemaker-cluster-libs-1.1.10-14.el6_5.2.x86_64
>>> pacemaker-debuginfo-1.1.10-1.el6.x86_64
>>> pacemaker-libs-1.1.10-14.el6_5.2.x86_64
>>>
>>>
>>> cat /etc/cluster/cluster.conf
>>> <cluster config_version="8" name="ga-ext_cluster">
>>> <cman transport="udpu"/>
>>>    <logging>
>>>     <logging_daemon name="corosync" debug="on"/>
>>>    </logging>
>>>    <clusternodes>
>>>      <clusternode name="ga1-ext" nodeid="1">
>>>        <fence>
>>>          <method name="pcmk-redirect">
>>>            <device name="pcmk" port="ga1-ext"/>
>>>          </method>
>>>        </fence>
>>>        <altname name="ga1-ext_alt"/>
>>>      </clusternode>
>>>      <clusternode name="ga2-ext" nodeid="2">
>>>        <fence>
>>>          <method name="pcmk-redirect">
>>>            <device name="pcmk" port="ga2-ext"/>
>>>          </method>
>>>        </fence>
>>>        <altname name="ga2-ext_alt"/>
>>>      </clusternode>
>>>    </clusternodes>
>>>    <fencedevices>
>>>      <fencedevice agent="fence_pcmk" name="pcmk"/>
>>>    </fencedevices>
>>> </cluster>
>>>
>>> crm configure show
>>> node ga1-ext \
>>>      attributes standby="off"
>>> node ga2-ext \
>>>      attributes standby="off"
>>> primitive ClusterIP ocf:heartbeat:IPaddr \
>>>      params ip="10.12.23.3" cidr_netmask="24" \
>>>      op monitor interval="30s"
>>> primitive SharedFS ocf:heartbeat:Filesystem \
>>>      params device="/dev/drbd/by-res/r0" directory="/shared"
>>> fstype="ext4" options="noatime,nobarrier"
>>> primitive dovecot lsb:dovecot
>>> primitive drbd0 ocf:linbit:drbd \
>>>      params drbd_resource="r0" \
>>>      op monitor interval="15s"
>>> primitive drbdlinks ocf:tummy:drbdlinks
>>> primitive mail ocf:heartbeat:MailTo \
>>>      params email="root@xxxxxxxxxxxxxxxxxxxx" subject="ga-ext cluster
>>> - "
>>> primitive mysql lsb:mysqld
>>> group service_group SharedFS drbdlinks ClusterIP mail mysql dovecot \
>>>      meta target-role="Started"
>>> ms ms_drbd0 drbd0 \
>>>      meta master-max="1" master-node-max="1" clone-max="2"
>>> clone-node-max="1" notify="true"
>>> colocation service_on_drbd inf: service_group ms_drbd0:Master
>>> order service_after_drbd inf: ms_drbd0:promote service_group:start
>>> property $id="cib-bootstrap-options" \
>>>      dc-version="1.1.10-14.el6_5.2-368c726" \
>>>      cluster-infrastructure="cman" \
>>>      expected-quorum-votes="2" \
>>>      stonith-enabled="false" \
>>>      no-quorum-policy="ignore" \
>>>      last-lrm-refresh="1391290945" \
>>>      maintenance-mode="false"
>>> rsc_defaults $id="rsc-options" \
>>>      resource-stickiness="100"
>>>
>>> extract from cluster.log
>>>
>>> Feb 01 21:40:15 corosync [MAIN  ] Completed service synchronization,
>>> ready to provide service.
>>> Feb 01 21:40:15 corosync [TOTEM ] waiting_trans_ack changed to 0
>>> Feb 01 21:40:15 corosync [TOTEM ] Marking ringid 1 interface 10.12.23.1
>>> FAULTY
>>> Feb 01 21:40:15 [13253] ga1-ext        cib:     info: crm_cs_flush:
>>> Sent 4 CPG messages  (0 remaining, last=48): OK (1)
>>> Feb 01 21:40:15 [13256] ga1-ext       crmd:     info: crm_cs_flush:
>>> Sent 3 CPG messages  (0 remaining, last=24): OK (1)
>>> Feb 01 21:40:16 corosync [TOTEM ] received message requesting test of
>>> ring now active
>>> Feb 01 21:40:16 corosync [TOTEM ] received message requesting test of
>>> ring now active
>>> Feb 01 21:40:16 corosync [TOTEM ] received message requesting test of
>>> ring now active
>>> Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 0
>>> Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 1
>>> Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 1
>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>> cib_process_diff:         Diff 0.299.3 -> 0.299.4 from ga2-ext not
>>> applied to 0.299.11: current "num_updates" is greater than required
>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>> cib_process_request:      Completed cib_query operation for section
>>> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='f
>>>
>>>
>>> ail-count-drbd0']: No such device or address (rc=-6,
>>> origin=local/attrd/34, version=0.299.11)
>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>> cib_process_request:      Completed cib_query operation for section
>>> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l
>>>
>>>
>>> ast-failure-mysql']: No such device or address (rc=-6,
>>> origin=local/attrd/35, version=0.299.11)
>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>> cib_process_request:      Completed cib_query operation for section
>>> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l
>>>
>>>
>>> ast-failure-drbd0']: No such device or address (rc=-6,
>>> origin=local/attrd/36, version=0.299.11)
>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>> cib_process_diff:         Diff 0.299.4 -> 0.299.5 from ga2-ext not
>>> applied to 0.299.11: current "num_updates" is greater than required
>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>> cib_process_diff:         Diff 0.299.5 -> 0.299.6 from ga2-ext not
>>> applied to 0.299.11: current "num_updates" is greater than required
>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>> cib_process_diff:         Diff 0.299.6 -> 0.299.7 from ga2-ext not
>>> applied to 0.299.11: current "num_updates" is greater than required
>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>> cib_process_diff:         Diff 0.299.7 -> 0.299.8 from ga2-ext not
>>> applied to 0.299.11: current "num_updates" is greater than required
>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>> cib_process_diff:         Diff 0.299.8 -> 0.299.9 from ga2-ext not
>>> applied to 0.299.11: current "num_updates" is greater than required
>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>> cib_process_request:      Completed cib_query operation for section
>>> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='m
>>>
>>>
>>> aster-drbd0']: OK (rc=0, origin=local/attrd/37, version=0.299.11)
>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>> cib_process_request:      Completed cib_modify operation for section
>>> status: OK (rc=0, origin=local/attrd/38, version=0.299.11)
>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>> cib_process_request:      Completed cib_query operation for section
>>> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l
>>>
>>>
>>> ast-failure-ClusterIP']: No such device or address (rc=-6,
>>> origin=local/attrd/39, version=0.299.11)
>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>> cib_process_request:      Completed cib_query operation for section
>>> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='p
>>>
>>>
>>> robe_complete']: OK (rc=0, origin=local/attrd/40, version=0.299.11)
>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>> cib_process_request:      Completed cib_modify operation for section
>>> status: OK (rc=0, origin=local/attrd/41, version=0.299.11)
>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>> cib_process_request:      Completed cib_query operation for section
>>> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='m
>>>
>>>
>>> aster-drbd0']: OK (rc=0, origin=local/attrd/42, version=0.299.11)
>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>> cib_process_request:      Completed cib_modify operation for section
>>> status: OK (rc=0, origin=local/attrd/43, version=0.299.11)
>>> Feb 01 21:40:17 [13256] ga1-ext       crmd:     info:
>>> register_fsa_error_adv:   Resetting the current action list
>>> Feb 01 21:40:17 [13256] ga1-ext       crmd:  warning:
>>> crmd_ha_msg_filter:       Another DC detected: ga2-ext (op=noop)
>>> Feb 01 21:40:17 [13256] ga1-ext       crmd:     info:
>>> register_fsa_error_adv:   Resetting the current action list
>>> Feb 01 21:40:17 [13256] ga1-ext       crmd:  warning:
>>> crmd_ha_msg_filter:       Another DC detected: ga2-ext (op=noop)
>>> Feb 01 21:40:17 corosync [CMAN  ] ais: deliver_fn source nodeid = 2,
>>> len=24, endian_conv=0
>>> Feb 01 21:40:17 corosync [CMAN  ] memb: Message on port 0 is 6
>>> Feb 01 21:40:17 corosync [CMAN  ] memb: got KILL for node 1
>>> Feb 01 21:40:17 [13256] ga1-ext       crmd:     info:
>>> register_fsa_error_adv:   Resetting the current action list
>>> Feb 01 21:40:17 [13256] ga1-ext       crmd:  warning:
>>> crmd_ha_msg_filter:       Another DC detected: ga2-ext (op=noop)
>>> Feb 01 21:40:17 [13256] ga1-ext       crmd:     info:
>>> register_fsa_error_adv:   Resetting the current action list
>>> Feb 01 21:40:17 [13256] ga1-ext       crmd:  warning:
>>> crmd_ha_msg_filter:       Another DC detected: ga2-ext (op=join_offer)
>>> Feb 01 21:40:17 [13256] ga1-ext       crmd:     info:
>>> do_state_transition:      State transition S_INTEGRATION -> S_ELECTION [
>>> input=I_ELECTION cause=C_FSA_INTERNAL origin=crmd_ha_msg_filter ]
>>> Feb 01 21:40:17 [13256] ga1-ext       crmd:     info: update_dc:
>>> Unset DC. Was ga1-ext
>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>> cib_process_diff:         Diff 0.299.9 -> 0.299.10 from ga2-ext not
>>> applied to 0.299.11: current "num_updates" is greater than required
>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>> cib_process_diff:         Diff 0.299.10 -> 0.299.11 from ga2-ext not
>>> applied to 0.299.11: current "num_updates" is greater than required
>>> Feb 01 21:40:18 [13247] ga1-ext pacemakerd:    error:
>>> pcmk_cpg_dispatch:        Connection to the CPG API failed: Library
>>> error (2)
>>> Feb 01 21:40:18 [13247] ga1-ext pacemakerd:    error: mcp_cpg_destroy:
>>> Connection destroyed
>>> Feb 01 21:40:18 [13247] ga1-ext pacemakerd:     info: crm_xml_cleanup:
>>> Cleaning up memory from libxml2
>>> Feb 01 21:40:18 [13255] ga1-ext      attrd:    error:
>>> pcmk_cpg_dispatch:        Connection to the CPG API failed: Library
>>> error (2)
>>> Feb 01 21:40:18 [13255] ga1-ext      attrd:     crit:
>>> attrd_cs_destroy:         Lost connection to Corosync service!
>>> Feb 01 21:40:18 [13255] ga1-ext      attrd:   notice: main: Exiting...
>>> Feb 01 21:40:18 [13255] ga1-ext      attrd:   notice: main:
>>> Disconnecting client 0x238ff10, pid=13256...
>>> Feb 01 21:40:18 [13255] ga1-ext      attrd:    error:
>>> attrd_cib_connection_destroy:     Connection to the CIB terminated...
>>> Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info:
>>> stonith_shutdown:         Terminating with  1 clients
>>> Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info:
>>> cib_connection_destroy:   Connection to the CIB closed.
>>> Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info:
>>> crm_client_destroy:       Destroying 0 events
>>> Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info:
>>> qb_ipcs_us_withdraw:      withdrawing server sockets
>>> Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info: main:     Done
>>> Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info: crm_xml_cleanup:
>>> Cleaning up memory from libxml2
>>> Feb 01 21:40:18 [13256] ga1-ext       crmd:    error:
>>> pcmk_cpg_dispatch:        Connection to the CPG API failed: Library
>>> error (2)
>>> Feb 01 21:40:18 [13256] ga1-ext       crmd:    error: crmd_cs_destroy:
>>> connection terminated
>>> Feb 01 21:40:18 [13256] ga1-ext       crmd:     info:
>>> qb_ipcs_us_withdraw:      withdrawing server sockets
>>> Feb 01 21:40:18 [13253] ga1-ext        cib:    error:
>>> pcmk_cpg_dispatch:        Connection to the CPG API failed: Library
>>> error (2)
>>> Feb 01 21:40:18 [13253] ga1-ext        cib:    error: cib_cs_destroy:
>>> Corosync connection lost!  Exiting.
>>> Feb 01 21:40:18 [13253] ga1-ext        cib:     info: terminate_cib:
>>> cib_cs_destroy: Exiting fast...
>>> Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
>>> qb_ipcs_us_withdraw:      withdrawing server sockets
>>> Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
>>> crm_client_destroy:       Destroying 0 events
>>> Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
>>> crm_client_destroy:       Destroying 0 events
>>> Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
>>> qb_ipcs_us_withdraw:      withdrawing server sockets
>>> Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
>>> crm_client_destroy:       Destroying 0 events
>>> Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
>>> qb_ipcs_us_withdraw:      withdrawing server sockets
>>> Feb 01 21:40:18 [13253] ga1-ext        cib:     info: crm_xml_cleanup:
>>> Cleaning up memory from libxml2
>>> Feb 01 21:40:18 [13256] ga1-ext       crmd:     info:
>>> tengine_stonith_connection_destroy:       Fencing daemon disconnected
>>> Feb 01 21:40:18 [13256] ga1-ext       crmd:   notice: crmd_exit:
>>> Forcing immediate exit: Link has been severed (67)
>>> Feb 01 21:40:18 [13256] ga1-ext       crmd:     info: crm_xml_cleanup:
>>> Cleaning up memory from libxml2
>>> Feb 01 21:40:18 [25258] ga1-ext       lrmd:     info:
>>> cancel_recurring_action:  Cancelling operation ClusterIP_monitor_30000
>>> Feb 01 21:40:18 [25258] ga1-ext       lrmd:  warning:
>>> qb_ipcs_event_sendv:      new_event_notification (25258-13256-6): Bad
>>> file descriptor (9)
>>> Feb 01 21:40:18 [25258] ga1-ext       lrmd:  warning:
>>> send_client_notify:       Notification of client
>>> crmd/0b3ea733-7340-439c-9f46-81b0d7e1f6a1 failed
>>> Feb 01 21:40:18 [25258] ga1-ext       lrmd:     info:
>>> crm_client_destroy:       Destroying 1 events
>>> Feb 01 21:40:18 [25260] ga1-ext    pengine:     info:
>>> crm_client_destroy:       Destroying 0 events
>>>
> 

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss