Re: pacemaker "CPG API: failed Library error"

Jan Friesse <jfriesse@xxxxxxxxxx> · Mon, 10 Feb 2014 13:55:55 +0100

Alessandro Bono napsal(a):
> 
> Il 10/02/14 12:24, Jan Friesse ha scritto:
>> Alessandro Bono napsal(a):
>>> Il 10/02/14 10:47, Jan Friesse ha scritto:
>>>> Alessandro,
>>>> can you find message like "Corosync main process was not scheduled for
>>>> ... ms" in log file (corosync must be at least 1.4.1-16 so CentOS 6.5)?
>>> Hi
>>>
>>> there is no a message like that in log file
>>> distro is centos 6.5
>> ok. So first thing to try is to remove redundant ring (just remove
>> altname tags) and see, if problem is still existing. If so, give a try
>> standard multicast (so remove udpu) but make sure to enable
>> multicast_querier (echo 1 >
>> /sys/class/net/$NETWORK_IFACE/bridge/multicast_querier, I'm using
>> libvirt qemu hook (/etc/libvirt/hooks/qemu) for that).
> redudant ring and udpu are a tentative to workaround the problem
> this library error was first seen on a configuration without these
> parameters
> I have same problem on two cluster with similar node configuration

Ok. Can you please then paste logs from single ring multicast
configuration (ideally centos 6.5)?

Regards,
  Honza

>>
>> Regards,
>>    Honza
>>
>>> rpm -qa corosync
>>> corosync-1.4.1-17.el6.x86_64
>>>
>>>> Regards,
>>>>     Honza
>>>>
>>>> Alessandro Bono napsal(a):
>>>>> Hi
>>>>>
>>>>> after changing cluster from corosync to cman+corosync (switching from
>>>>> centos 6.3 to 6.4) I have a recurring problem with pacemaker/corosync
>>>>> pacemaker report this error
>>>>>
>>>>> pacemakerd:    error: pcmk_cpg_dispatch:     Connection to the CPG API
>>>>> failed: Library error (2)
>>>>>
>>>>> and shutdown itself
>>>>> This normally happens when the host machine is under high load, at
>>>>> example during a full backup
>>>>>
>>>>> in addition, there are a lot of these messages
>>>>>
>>>>> Feb 01 23:27:04 corosync [TOTEM ] received message requesting test of
>>>>> ring now active
>>>>> Feb 01 23:27:04 corosync [TOTEM ] Automatically recovered ring 1
>>>>> Feb 01 23:27:06 corosync [TOTEM ] Marking ringid 0 interface
>>>>> 10.12.32.1
>>>>> FAULTY
>>>>> Feb 01 23:27:07 corosync [TOTEM ] received message requesting test of
>>>>> ring now active
>>>>> Feb 01 23:27:07 corosync [TOTEM ] Automatically recovered ring 0
>>>>> Feb 01 23:27:07 corosync [TOTEM ] received message requesting test of
>>>>> ring now active
>>>>> Feb 01 23:27:07 corosync [TOTEM ] Automatically recovered ring 0
>>>>> Feb 01 23:27:09 corosync [TOTEM ] Marking ringid 1 interface
>>>>> 10.12.23.1
>>>>> FAULTY
>>>>> Feb 01 23:27:10 corosync [TOTEM ] received message requesting test of
>>>>> ring now active
>>>>> Feb 01 23:27:10 corosync [TOTEM ] Automatically recovered ring 1
>>>>> Feb 01 23:27:10 corosync [TOTEM ] received message requesting test of
>>>>> ring now active
>>>>> Feb 01 23:27:10 corosync [TOTEM ] Automatically recovered ring 0
>>>>> Feb 01 23:27:12 corosync [TOTEM ] Marking ringid 1 interface
>>>>> 10.12.23.1
>>>>> FAULTY
>>>>> Feb 01 23:27:12 corosync [TOTEM ] Marking ringid 0 interface
>>>>> 10.12.32.1
>>>>> FAULTY
>>>>> Feb 01 23:27:13 corosync [TOTEM ] received message requesting test of
>>>>> ring now active
>>>>> Feb 01 23:27:13 corosync [TOTEM ] received message requesting test of
>>>>> ring now active
>>>>>
>>>>> I reported this problem to pacemaker ml but they said it's a corosync
>>>>> problem
>>>>> same problem with centos 6.5
>>>>>
>>>>> I tried to switch comunication to udpu and add another comunication
>>>>> path
>>>>> but without any luck
>>>>> cluster nodes are kvm virtual machine
>>>>>
>>>>> Is it a configuration problem?
>>>>>
>>>>> some info below, I can provide full log if necessary
>>>>>
>>>>> rpm -qa  | egrep "pacem|coro"| sort
>>>>> corosync-1.4.1-17.el6.x86_64
>>>>> corosynclib-1.4.1-17.el6.x86_64
>>>>> drbd-pacemaker-8.3.16-1.el6.x86_64
>>>>> pacemaker-1.1.10-14.el6_5.2.x86_64
>>>>> pacemaker-cli-1.1.10-14.el6_5.2.x86_64
>>>>> pacemaker-cluster-libs-1.1.10-14.el6_5.2.x86_64
>>>>> pacemaker-debuginfo-1.1.10-1.el6.x86_64
>>>>> pacemaker-libs-1.1.10-14.el6_5.2.x86_64
>>>>>
>>>>>
>>>>> cat /etc/cluster/cluster.conf
>>>>> <cluster config_version="8" name="ga-ext_cluster">
>>>>> <cman transport="udpu"/>
>>>>>     <logging>
>>>>>      <logging_daemon name="corosync" debug="on"/>
>>>>>     </logging>
>>>>>     <clusternodes>
>>>>>       <clusternode name="ga1-ext" nodeid="1">
>>>>>         <fence>
>>>>>           <method name="pcmk-redirect">
>>>>>             <device name="pcmk" port="ga1-ext"/>
>>>>>           </method>
>>>>>         </fence>
>>>>>         <altname name="ga1-ext_alt"/>
>>>>>       </clusternode>
>>>>>       <clusternode name="ga2-ext" nodeid="2">
>>>>>         <fence>
>>>>>           <method name="pcmk-redirect">
>>>>>             <device name="pcmk" port="ga2-ext"/>
>>>>>           </method>
>>>>>         </fence>
>>>>>         <altname name="ga2-ext_alt"/>
>>>>>       </clusternode>
>>>>>     </clusternodes>
>>>>>     <fencedevices>
>>>>>       <fencedevice agent="fence_pcmk" name="pcmk"/>
>>>>>     </fencedevices>
>>>>> </cluster>
>>>>>
>>>>> crm configure show
>>>>> node ga1-ext \
>>>>>       attributes standby="off"
>>>>> node ga2-ext \
>>>>>       attributes standby="off"
>>>>> primitive ClusterIP ocf:heartbeat:IPaddr \
>>>>>       params ip="10.12.23.3" cidr_netmask="24" \
>>>>>       op monitor interval="30s"
>>>>> primitive SharedFS ocf:heartbeat:Filesystem \
>>>>>       params device="/dev/drbd/by-res/r0" directory="/shared"
>>>>> fstype="ext4" options="noatime,nobarrier"
>>>>> primitive dovecot lsb:dovecot
>>>>> primitive drbd0 ocf:linbit:drbd \
>>>>>       params drbd_resource="r0" \
>>>>>       op monitor interval="15s"
>>>>> primitive drbdlinks ocf:tummy:drbdlinks
>>>>> primitive mail ocf:heartbeat:MailTo \
>>>>>       params email="root@xxxxxxxxxxxxxxxxxxxx" subject="ga-ext cluster
>>>>> - "
>>>>> primitive mysql lsb:mysqld
>>>>> group service_group SharedFS drbdlinks ClusterIP mail mysql dovecot \
>>>>>       meta target-role="Started"
>>>>> ms ms_drbd0 drbd0 \
>>>>>       meta master-max="1" master-node-max="1" clone-max="2"
>>>>> clone-node-max="1" notify="true"
>>>>> colocation service_on_drbd inf: service_group ms_drbd0:Master
>>>>> order service_after_drbd inf: ms_drbd0:promote service_group:start
>>>>> property $id="cib-bootstrap-options" \
>>>>>       dc-version="1.1.10-14.el6_5.2-368c726" \
>>>>>       cluster-infrastructure="cman" \
>>>>>       expected-quorum-votes="2" \
>>>>>       stonith-enabled="false" \
>>>>>       no-quorum-policy="ignore" \
>>>>>       last-lrm-refresh="1391290945" \
>>>>>       maintenance-mode="false"
>>>>> rsc_defaults $id="rsc-options" \
>>>>>       resource-stickiness="100"
>>>>>
>>>>> extract from cluster.log
>>>>>
>>>>> Feb 01 21:40:15 corosync [MAIN  ] Completed service synchronization,
>>>>> ready to provide service.
>>>>> Feb 01 21:40:15 corosync [TOTEM ] waiting_trans_ack changed to 0
>>>>> Feb 01 21:40:15 corosync [TOTEM ] Marking ringid 1 interface
>>>>> 10.12.23.1
>>>>> FAULTY
>>>>> Feb 01 21:40:15 [13253] ga1-ext        cib:     info: crm_cs_flush:
>>>>> Sent 4 CPG messages  (0 remaining, last=48): OK (1)
>>>>> Feb 01 21:40:15 [13256] ga1-ext       crmd:     info: crm_cs_flush:
>>>>> Sent 3 CPG messages  (0 remaining, last=24): OK (1)
>>>>> Feb 01 21:40:16 corosync [TOTEM ] received message requesting test of
>>>>> ring now active
>>>>> Feb 01 21:40:16 corosync [TOTEM ] received message requesting test of
>>>>> ring now active
>>>>> Feb 01 21:40:16 corosync [TOTEM ] received message requesting test of
>>>>> ring now active
>>>>> Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 0
>>>>> Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 1
>>>>> Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 1
>>>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>>>> cib_process_diff:         Diff 0.299.3 -> 0.299.4 from ga2-ext not
>>>>> applied to 0.299.11: current "num_updates" is greater than required
>>>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>>>> cib_process_request:      Completed cib_query operation for section
>>>>> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='f
>>>>>
>>>>>
>>>>>
>>>>> ail-count-drbd0']: No such device or address (rc=-6,
>>>>> origin=local/attrd/34, version=0.299.11)
>>>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>>>> cib_process_request:      Completed cib_query operation for section
>>>>> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l
>>>>>
>>>>>
>>>>>
>>>>> ast-failure-mysql']: No such device or address (rc=-6,
>>>>> origin=local/attrd/35, version=0.299.11)
>>>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>>>> cib_process_request:      Completed cib_query operation for section
>>>>> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l
>>>>>
>>>>>
>>>>>
>>>>> ast-failure-drbd0']: No such device or address (rc=-6,
>>>>> origin=local/attrd/36, version=0.299.11)
>>>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>>>> cib_process_diff:         Diff 0.299.4 -> 0.299.5 from ga2-ext not
>>>>> applied to 0.299.11: current "num_updates" is greater than required
>>>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>>>> cib_process_diff:         Diff 0.299.5 -> 0.299.6 from ga2-ext not
>>>>> applied to 0.299.11: current "num_updates" is greater than required
>>>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>>>> cib_process_diff:         Diff 0.299.6 -> 0.299.7 from ga2-ext not
>>>>> applied to 0.299.11: current "num_updates" is greater than required
>>>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>>>> cib_process_diff:         Diff 0.299.7 -> 0.299.8 from ga2-ext not
>>>>> applied to 0.299.11: current "num_updates" is greater than required
>>>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>>>> cib_process_diff:         Diff 0.299.8 -> 0.299.9 from ga2-ext not
>>>>> applied to 0.299.11: current "num_updates" is greater than required
>>>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>>>> cib_process_request:      Completed cib_query operation for section
>>>>> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='m
>>>>>
>>>>>
>>>>>
>>>>> aster-drbd0']: OK (rc=0, origin=local/attrd/37, version=0.299.11)
>>>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>>>> cib_process_request:      Completed cib_modify operation for section
>>>>> status: OK (rc=0, origin=local/attrd/38, version=0.299.11)
>>>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>>>> cib_process_request:      Completed cib_query operation for section
>>>>> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l
>>>>>
>>>>>
>>>>>
>>>>> ast-failure-ClusterIP']: No such device or address (rc=-6,
>>>>> origin=local/attrd/39, version=0.299.11)
>>>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>>>> cib_process_request:      Completed cib_query operation for section
>>>>> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='p
>>>>>
>>>>>
>>>>>
>>>>> robe_complete']: OK (rc=0, origin=local/attrd/40, version=0.299.11)
>>>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>>>> cib_process_request:      Completed cib_modify operation for section
>>>>> status: OK (rc=0, origin=local/attrd/41, version=0.299.11)
>>>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>>>> cib_process_request:      Completed cib_query operation for section
>>>>> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='m
>>>>>
>>>>>
>>>>>
>>>>> aster-drbd0']: OK (rc=0, origin=local/attrd/42, version=0.299.11)
>>>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>>>> cib_process_request:      Completed cib_modify operation for section
>>>>> status: OK (rc=0, origin=local/attrd/43, version=0.299.11)
>>>>> Feb 01 21:40:17 [13256] ga1-ext       crmd:     info:
>>>>> register_fsa_error_adv:   Resetting the current action list
>>>>> Feb 01 21:40:17 [13256] ga1-ext       crmd:  warning:
>>>>> crmd_ha_msg_filter:       Another DC detected: ga2-ext (op=noop)
>>>>> Feb 01 21:40:17 [13256] ga1-ext       crmd:     info:
>>>>> register_fsa_error_adv:   Resetting the current action list
>>>>> Feb 01 21:40:17 [13256] ga1-ext       crmd:  warning:
>>>>> crmd_ha_msg_filter:       Another DC detected: ga2-ext (op=noop)
>>>>> Feb 01 21:40:17 corosync [CMAN  ] ais: deliver_fn source nodeid = 2,
>>>>> len=24, endian_conv=0
>>>>> Feb 01 21:40:17 corosync [CMAN  ] memb: Message on port 0 is 6
>>>>> Feb 01 21:40:17 corosync [CMAN  ] memb: got KILL for node 1
>>>>> Feb 01 21:40:17 [13256] ga1-ext       crmd:     info:
>>>>> register_fsa_error_adv:   Resetting the current action list
>>>>> Feb 01 21:40:17 [13256] ga1-ext       crmd:  warning:
>>>>> crmd_ha_msg_filter:       Another DC detected: ga2-ext (op=noop)
>>>>> Feb 01 21:40:17 [13256] ga1-ext       crmd:     info:
>>>>> register_fsa_error_adv:   Resetting the current action list
>>>>> Feb 01 21:40:17 [13256] ga1-ext       crmd:  warning:
>>>>> crmd_ha_msg_filter:       Another DC detected: ga2-ext (op=join_offer)
>>>>> Feb 01 21:40:17 [13256] ga1-ext       crmd:     info:
>>>>> do_state_transition:      State transition S_INTEGRATION ->
>>>>> S_ELECTION [
>>>>> input=I_ELECTION cause=C_FSA_INTERNAL origin=crmd_ha_msg_filter ]
>>>>> Feb 01 21:40:17 [13256] ga1-ext       crmd:     info: update_dc:
>>>>> Unset DC. Was ga1-ext
>>>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>>>> cib_process_diff:         Diff 0.299.9 -> 0.299.10 from ga2-ext not
>>>>> applied to 0.299.11: current "num_updates" is greater than required
>>>>> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
>>>>> cib_process_diff:         Diff 0.299.10 -> 0.299.11 from ga2-ext not
>>>>> applied to 0.299.11: current "num_updates" is greater than required
>>>>> Feb 01 21:40:18 [13247] ga1-ext pacemakerd:    error:
>>>>> pcmk_cpg_dispatch:        Connection to the CPG API failed: Library
>>>>> error (2)
>>>>> Feb 01 21:40:18 [13247] ga1-ext pacemakerd:    error: mcp_cpg_destroy:
>>>>> Connection destroyed
>>>>> Feb 01 21:40:18 [13247] ga1-ext pacemakerd:     info: crm_xml_cleanup:
>>>>> Cleaning up memory from libxml2
>>>>> Feb 01 21:40:18 [13255] ga1-ext      attrd:    error:
>>>>> pcmk_cpg_dispatch:        Connection to the CPG API failed: Library
>>>>> error (2)
>>>>> Feb 01 21:40:18 [13255] ga1-ext      attrd:     crit:
>>>>> attrd_cs_destroy:         Lost connection to Corosync service!
>>>>> Feb 01 21:40:18 [13255] ga1-ext      attrd:   notice: main: Exiting...
>>>>> Feb 01 21:40:18 [13255] ga1-ext      attrd:   notice: main:
>>>>> Disconnecting client 0x238ff10, pid=13256...
>>>>> Feb 01 21:40:18 [13255] ga1-ext      attrd:    error:
>>>>> attrd_cib_connection_destroy:     Connection to the CIB terminated...
>>>>> Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info:
>>>>> stonith_shutdown:         Terminating with  1 clients
>>>>> Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info:
>>>>> cib_connection_destroy:   Connection to the CIB closed.
>>>>> Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info:
>>>>> crm_client_destroy:       Destroying 0 events
>>>>> Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info:
>>>>> qb_ipcs_us_withdraw:      withdrawing server sockets
>>>>> Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info: main:     Done
>>>>> Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info: crm_xml_cleanup:
>>>>> Cleaning up memory from libxml2
>>>>> Feb 01 21:40:18 [13256] ga1-ext       crmd:    error:
>>>>> pcmk_cpg_dispatch:        Connection to the CPG API failed: Library
>>>>> error (2)
>>>>> Feb 01 21:40:18 [13256] ga1-ext       crmd:    error: crmd_cs_destroy:
>>>>> connection terminated
>>>>> Feb 01 21:40:18 [13256] ga1-ext       crmd:     info:
>>>>> qb_ipcs_us_withdraw:      withdrawing server sockets
>>>>> Feb 01 21:40:18 [13253] ga1-ext        cib:    error:
>>>>> pcmk_cpg_dispatch:        Connection to the CPG API failed: Library
>>>>> error (2)
>>>>> Feb 01 21:40:18 [13253] ga1-ext        cib:    error: cib_cs_destroy:
>>>>> Corosync connection lost!  Exiting.
>>>>> Feb 01 21:40:18 [13253] ga1-ext        cib:     info: terminate_cib:
>>>>> cib_cs_destroy: Exiting fast...
>>>>> Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
>>>>> qb_ipcs_us_withdraw:      withdrawing server sockets
>>>>> Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
>>>>> crm_client_destroy:       Destroying 0 events
>>>>> Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
>>>>> crm_client_destroy:       Destroying 0 events
>>>>> Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
>>>>> qb_ipcs_us_withdraw:      withdrawing server sockets
>>>>> Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
>>>>> crm_client_destroy:       Destroying 0 events
>>>>> Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
>>>>> qb_ipcs_us_withdraw:      withdrawing server sockets
>>>>> Feb 01 21:40:18 [13253] ga1-ext        cib:     info: crm_xml_cleanup:
>>>>> Cleaning up memory from libxml2
>>>>> Feb 01 21:40:18 [13256] ga1-ext       crmd:     info:
>>>>> tengine_stonith_connection_destroy:       Fencing daemon disconnected
>>>>> Feb 01 21:40:18 [13256] ga1-ext       crmd:   notice: crmd_exit:
>>>>> Forcing immediate exit: Link has been severed (67)
>>>>> Feb 01 21:40:18 [13256] ga1-ext       crmd:     info: crm_xml_cleanup:
>>>>> Cleaning up memory from libxml2
>>>>> Feb 01 21:40:18 [25258] ga1-ext       lrmd:     info:
>>>>> cancel_recurring_action:  Cancelling operation ClusterIP_monitor_30000
>>>>> Feb 01 21:40:18 [25258] ga1-ext       lrmd:  warning:
>>>>> qb_ipcs_event_sendv:      new_event_notification (25258-13256-6): Bad
>>>>> file descriptor (9)
>>>>> Feb 01 21:40:18 [25258] ga1-ext       lrmd:  warning:
>>>>> send_client_notify:       Notification of client
>>>>> crmd/0b3ea733-7340-439c-9f46-81b0d7e1f6a1 failed
>>>>> Feb 01 21:40:18 [25258] ga1-ext       lrmd:     info:
>>>>> crm_client_destroy:       Destroying 1 events
>>>>> Feb 01 21:40:18 [25260] ga1-ext    pengine:     info:
>>>>> crm_client_destroy:       Destroying 0 events
>>>>>
> 

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss