Re: pacemaker "CPG API: failed Library error"

Jan Friesse <jfriesse@xxxxxxxxxx> · Mon, 10 Feb 2014 10:47:17 +0100

Alessandro,
can you find message like "Corosync main process was not scheduled for
... ms" in log file (corosync must be at least 1.4.1-16 so CentOS 6.5)?

Regards,
  Honza

Alessandro Bono napsal(a):
> Hi
> 
> after changing cluster from corosync to cman+corosync (switching from
> centos 6.3 to 6.4) I have a recurring problem with pacemaker/corosync
> pacemaker report this error
> 
> pacemakerd:    error: pcmk_cpg_dispatch:     Connection to the CPG API
> failed: Library error (2)
> 
> and shutdown itself
> This normally happens when the host machine is under high load, at
> example during a full backup
> 
> in addition, there are a lot of these messages
> 
> Feb 01 23:27:04 corosync [TOTEM ] received message requesting test of
> ring now active
> Feb 01 23:27:04 corosync [TOTEM ] Automatically recovered ring 1
> Feb 01 23:27:06 corosync [TOTEM ] Marking ringid 0 interface 10.12.32.1
> FAULTY
> Feb 01 23:27:07 corosync [TOTEM ] received message requesting test of
> ring now active
> Feb 01 23:27:07 corosync [TOTEM ] Automatically recovered ring 0
> Feb 01 23:27:07 corosync [TOTEM ] received message requesting test of
> ring now active
> Feb 01 23:27:07 corosync [TOTEM ] Automatically recovered ring 0
> Feb 01 23:27:09 corosync [TOTEM ] Marking ringid 1 interface 10.12.23.1
> FAULTY
> Feb 01 23:27:10 corosync [TOTEM ] received message requesting test of
> ring now active
> Feb 01 23:27:10 corosync [TOTEM ] Automatically recovered ring 1
> Feb 01 23:27:10 corosync [TOTEM ] received message requesting test of
> ring now active
> Feb 01 23:27:10 corosync [TOTEM ] Automatically recovered ring 0
> Feb 01 23:27:12 corosync [TOTEM ] Marking ringid 1 interface 10.12.23.1
> FAULTY
> Feb 01 23:27:12 corosync [TOTEM ] Marking ringid 0 interface 10.12.32.1
> FAULTY
> Feb 01 23:27:13 corosync [TOTEM ] received message requesting test of
> ring now active
> Feb 01 23:27:13 corosync [TOTEM ] received message requesting test of
> ring now active
> 
> I reported this problem to pacemaker ml but they said it's a corosync
> problem
> same problem with centos 6.5
> 
> I tried to switch comunication to udpu and add another comunication path
> but without any luck
> cluster nodes are kvm virtual machine
> 
> Is it a configuration problem?
> 
> some info below, I can provide full log if necessary
> 
> rpm -qa  | egrep "pacem|coro"| sort
> corosync-1.4.1-17.el6.x86_64
> corosynclib-1.4.1-17.el6.x86_64
> drbd-pacemaker-8.3.16-1.el6.x86_64
> pacemaker-1.1.10-14.el6_5.2.x86_64
> pacemaker-cli-1.1.10-14.el6_5.2.x86_64
> pacemaker-cluster-libs-1.1.10-14.el6_5.2.x86_64
> pacemaker-debuginfo-1.1.10-1.el6.x86_64
> pacemaker-libs-1.1.10-14.el6_5.2.x86_64
> 
> 
> cat /etc/cluster/cluster.conf
> <cluster config_version="8" name="ga-ext_cluster">
> <cman transport="udpu"/>
>   <logging>
>    <logging_daemon name="corosync" debug="on"/>
>   </logging>
>   <clusternodes>
>     <clusternode name="ga1-ext" nodeid="1">
>       <fence>
>         <method name="pcmk-redirect">
>           <device name="pcmk" port="ga1-ext"/>
>         </method>
>       </fence>
>       <altname name="ga1-ext_alt"/>
>     </clusternode>
>     <clusternode name="ga2-ext" nodeid="2">
>       <fence>
>         <method name="pcmk-redirect">
>           <device name="pcmk" port="ga2-ext"/>
>         </method>
>       </fence>
>       <altname name="ga2-ext_alt"/>
>     </clusternode>
>   </clusternodes>
>   <fencedevices>
>     <fencedevice agent="fence_pcmk" name="pcmk"/>
>   </fencedevices>
> </cluster>
> 
> crm configure show
> node ga1-ext \
>     attributes standby="off"
> node ga2-ext \
>     attributes standby="off"
> primitive ClusterIP ocf:heartbeat:IPaddr \
>     params ip="10.12.23.3" cidr_netmask="24" \
>     op monitor interval="30s"
> primitive SharedFS ocf:heartbeat:Filesystem \
>     params device="/dev/drbd/by-res/r0" directory="/shared"
> fstype="ext4" options="noatime,nobarrier"
> primitive dovecot lsb:dovecot
> primitive drbd0 ocf:linbit:drbd \
>     params drbd_resource="r0" \
>     op monitor interval="15s"
> primitive drbdlinks ocf:tummy:drbdlinks
> primitive mail ocf:heartbeat:MailTo \
>     params email="root@xxxxxxxxxxxxxxxxxxxx" subject="ga-ext cluster - "
> primitive mysql lsb:mysqld
> group service_group SharedFS drbdlinks ClusterIP mail mysql dovecot \
>     meta target-role="Started"
> ms ms_drbd0 drbd0 \
>     meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> colocation service_on_drbd inf: service_group ms_drbd0:Master
> order service_after_drbd inf: ms_drbd0:promote service_group:start
> property $id="cib-bootstrap-options" \
>     dc-version="1.1.10-14.el6_5.2-368c726" \
>     cluster-infrastructure="cman" \
>     expected-quorum-votes="2" \
>     stonith-enabled="false" \
>     no-quorum-policy="ignore" \
>     last-lrm-refresh="1391290945" \
>     maintenance-mode="false"
> rsc_defaults $id="rsc-options" \
>     resource-stickiness="100"
> 
> extract from cluster.log
> 
> Feb 01 21:40:15 corosync [MAIN  ] Completed service synchronization,
> ready to provide service.
> Feb 01 21:40:15 corosync [TOTEM ] waiting_trans_ack changed to 0
> Feb 01 21:40:15 corosync [TOTEM ] Marking ringid 1 interface 10.12.23.1
> FAULTY
> Feb 01 21:40:15 [13253] ga1-ext        cib:     info: crm_cs_flush:    
> Sent 4 CPG messages  (0 remaining, last=48): OK (1)
> Feb 01 21:40:15 [13256] ga1-ext       crmd:     info: crm_cs_flush:    
> Sent 3 CPG messages  (0 remaining, last=24): OK (1)
> Feb 01 21:40:16 corosync [TOTEM ] received message requesting test of
> ring now active
> Feb 01 21:40:16 corosync [TOTEM ] received message requesting test of
> ring now active
> Feb 01 21:40:16 corosync [TOTEM ] received message requesting test of
> ring now active
> Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 0
> Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 1
> Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 1
> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
> cib_process_diff:         Diff 0.299.3 -> 0.299.4 from ga2-ext not
> applied to 0.299.11: current "num_updates" is greater than required
> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
> cib_process_request:      Completed cib_query operation for section
> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='f
> 
> ail-count-drbd0']: No such device or address (rc=-6,
> origin=local/attrd/34, version=0.299.11)
> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
> cib_process_request:      Completed cib_query operation for section
> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l
> 
> ast-failure-mysql']: No such device or address (rc=-6,
> origin=local/attrd/35, version=0.299.11)
> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
> cib_process_request:      Completed cib_query operation for section
> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l
> 
> ast-failure-drbd0']: No such device or address (rc=-6,
> origin=local/attrd/36, version=0.299.11)
> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
> cib_process_diff:         Diff 0.299.4 -> 0.299.5 from ga2-ext not
> applied to 0.299.11: current "num_updates" is greater than required
> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
> cib_process_diff:         Diff 0.299.5 -> 0.299.6 from ga2-ext not
> applied to 0.299.11: current "num_updates" is greater than required
> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
> cib_process_diff:         Diff 0.299.6 -> 0.299.7 from ga2-ext not
> applied to 0.299.11: current "num_updates" is greater than required
> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
> cib_process_diff:         Diff 0.299.7 -> 0.299.8 from ga2-ext not
> applied to 0.299.11: current "num_updates" is greater than required
> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
> cib_process_diff:         Diff 0.299.8 -> 0.299.9 from ga2-ext not
> applied to 0.299.11: current "num_updates" is greater than required
> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
> cib_process_request:      Completed cib_query operation for section
> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='m
> 
> aster-drbd0']: OK (rc=0, origin=local/attrd/37, version=0.299.11)
> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
> cib_process_request:      Completed cib_modify operation for section
> status: OK (rc=0, origin=local/attrd/38, version=0.299.11)
> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
> cib_process_request:      Completed cib_query operation for section
> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l
> 
> ast-failure-ClusterIP']: No such device or address (rc=-6,
> origin=local/attrd/39, version=0.299.11)
> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
> cib_process_request:      Completed cib_query operation for section
> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='p
> 
> robe_complete']: OK (rc=0, origin=local/attrd/40, version=0.299.11)
> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
> cib_process_request:      Completed cib_modify operation for section
> status: OK (rc=0, origin=local/attrd/41, version=0.299.11)
> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
> cib_process_request:      Completed cib_query operation for section
> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='m
> 
> aster-drbd0']: OK (rc=0, origin=local/attrd/42, version=0.299.11)
> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
> cib_process_request:      Completed cib_modify operation for section
> status: OK (rc=0, origin=local/attrd/43, version=0.299.11)
> Feb 01 21:40:17 [13256] ga1-ext       crmd:     info:
> register_fsa_error_adv:   Resetting the current action list
> Feb 01 21:40:17 [13256] ga1-ext       crmd:  warning:
> crmd_ha_msg_filter:       Another DC detected: ga2-ext (op=noop)
> Feb 01 21:40:17 [13256] ga1-ext       crmd:     info:
> register_fsa_error_adv:   Resetting the current action list
> Feb 01 21:40:17 [13256] ga1-ext       crmd:  warning:
> crmd_ha_msg_filter:       Another DC detected: ga2-ext (op=noop)
> Feb 01 21:40:17 corosync [CMAN  ] ais: deliver_fn source nodeid = 2,
> len=24, endian_conv=0
> Feb 01 21:40:17 corosync [CMAN  ] memb: Message on port 0 is 6
> Feb 01 21:40:17 corosync [CMAN  ] memb: got KILL for node 1
> Feb 01 21:40:17 [13256] ga1-ext       crmd:     info:
> register_fsa_error_adv:   Resetting the current action list
> Feb 01 21:40:17 [13256] ga1-ext       crmd:  warning:
> crmd_ha_msg_filter:       Another DC detected: ga2-ext (op=noop)
> Feb 01 21:40:17 [13256] ga1-ext       crmd:     info:
> register_fsa_error_adv:   Resetting the current action list
> Feb 01 21:40:17 [13256] ga1-ext       crmd:  warning:
> crmd_ha_msg_filter:       Another DC detected: ga2-ext (op=join_offer)
> Feb 01 21:40:17 [13256] ga1-ext       crmd:     info:
> do_state_transition:      State transition S_INTEGRATION -> S_ELECTION [
> input=I_ELECTION cause=C_FSA_INTERNAL origin=crmd_ha_msg_filter ]
> Feb 01 21:40:17 [13256] ga1-ext       crmd:     info: update_dc:       
> Unset DC. Was ga1-ext
> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
> cib_process_diff:         Diff 0.299.9 -> 0.299.10 from ga2-ext not
> applied to 0.299.11: current "num_updates" is greater than required
> Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
> cib_process_diff:         Diff 0.299.10 -> 0.299.11 from ga2-ext not
> applied to 0.299.11: current "num_updates" is greater than required
> Feb 01 21:40:18 [13247] ga1-ext pacemakerd:    error:
> pcmk_cpg_dispatch:        Connection to the CPG API failed: Library
> error (2)
> Feb 01 21:40:18 [13247] ga1-ext pacemakerd:    error: mcp_cpg_destroy: 
> Connection destroyed
> Feb 01 21:40:18 [13247] ga1-ext pacemakerd:     info: crm_xml_cleanup: 
> Cleaning up memory from libxml2
> Feb 01 21:40:18 [13255] ga1-ext      attrd:    error:
> pcmk_cpg_dispatch:        Connection to the CPG API failed: Library
> error (2)
> Feb 01 21:40:18 [13255] ga1-ext      attrd:     crit:
> attrd_cs_destroy:         Lost connection to Corosync service!
> Feb 01 21:40:18 [13255] ga1-ext      attrd:   notice: main: Exiting...
> Feb 01 21:40:18 [13255] ga1-ext      attrd:   notice: main:
> Disconnecting client 0x238ff10, pid=13256...
> Feb 01 21:40:18 [13255] ga1-ext      attrd:    error:
> attrd_cib_connection_destroy:     Connection to the CIB terminated...
> Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info:
> stonith_shutdown:         Terminating with  1 clients
> Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info:
> cib_connection_destroy:   Connection to the CIB closed.
> Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info:
> crm_client_destroy:       Destroying 0 events
> Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info:
> qb_ipcs_us_withdraw:      withdrawing server sockets
> Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info: main:     Done
> Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info: crm_xml_cleanup: 
> Cleaning up memory from libxml2
> Feb 01 21:40:18 [13256] ga1-ext       crmd:    error:
> pcmk_cpg_dispatch:        Connection to the CPG API failed: Library
> error (2)
> Feb 01 21:40:18 [13256] ga1-ext       crmd:    error: crmd_cs_destroy: 
> connection terminated
> Feb 01 21:40:18 [13256] ga1-ext       crmd:     info:
> qb_ipcs_us_withdraw:      withdrawing server sockets
> Feb 01 21:40:18 [13253] ga1-ext        cib:    error:
> pcmk_cpg_dispatch:        Connection to the CPG API failed: Library
> error (2)
> Feb 01 21:40:18 [13253] ga1-ext        cib:    error: cib_cs_destroy:  
> Corosync connection lost!  Exiting.
> Feb 01 21:40:18 [13253] ga1-ext        cib:     info: terminate_cib:   
> cib_cs_destroy: Exiting fast...
> Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
> qb_ipcs_us_withdraw:      withdrawing server sockets
> Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
> crm_client_destroy:       Destroying 0 events
> Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
> crm_client_destroy:       Destroying 0 events
> Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
> qb_ipcs_us_withdraw:      withdrawing server sockets
> Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
> crm_client_destroy:       Destroying 0 events
> Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
> qb_ipcs_us_withdraw:      withdrawing server sockets
> Feb 01 21:40:18 [13253] ga1-ext        cib:     info: crm_xml_cleanup: 
> Cleaning up memory from libxml2
> Feb 01 21:40:18 [13256] ga1-ext       crmd:     info:
> tengine_stonith_connection_destroy:       Fencing daemon disconnected
> Feb 01 21:40:18 [13256] ga1-ext       crmd:   notice: crmd_exit:       
> Forcing immediate exit: Link has been severed (67)
> Feb 01 21:40:18 [13256] ga1-ext       crmd:     info: crm_xml_cleanup: 
> Cleaning up memory from libxml2
> Feb 01 21:40:18 [25258] ga1-ext       lrmd:     info:
> cancel_recurring_action:  Cancelling operation ClusterIP_monitor_30000
> Feb 01 21:40:18 [25258] ga1-ext       lrmd:  warning:
> qb_ipcs_event_sendv:      new_event_notification (25258-13256-6): Bad
> file descriptor (9)
> Feb 01 21:40:18 [25258] ga1-ext       lrmd:  warning:
> send_client_notify:       Notification of client
> crmd/0b3ea733-7340-439c-9f46-81b0d7e1f6a1 failed
> Feb 01 21:40:18 [25258] ga1-ext       lrmd:     info:
> crm_client_destroy:       Destroying 1 events
> Feb 01 21:40:18 [25260] ga1-ext    pengine:     info:
> crm_client_destroy:       Destroying 0 events
> 

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss