Re: pacemaker "CPG API: failed Library error"

Alessandro Bono <alessandro.bono@xxxxxxxxx> · Mon, 10 Feb 2014 14:41:48 +0100

Il 10/02/14 13:55, Jan Friesse ha scritto:
Alessandro Bono napsal(a):
Il 10/02/14 12:24, Jan Friesse ha scritto:
Alessandro Bono napsal(a):
Il 10/02/14 10:47, Jan Friesse ha scritto:
Alessandro,
can you find message like "Corosync main process was not scheduled for
... ms" in log file (corosync must be at least 1.4.1-16 so CentOS 6.5)?
Hi

there is no a message like that in log file
distro is centos 6.5
ok. So first thing to try is to remove redundant ring (just remove
altname tags) and see, if problem is still existing. If so, give a try
standard multicast (so remove udpu) but make sure to enable
multicast_querier (echo 1 >
/sys/class/net/$NETWORK_IFACE/bridge/multicast_querier, I'm using
libvirt qemu hook (/etc/libvirt/hooks/qemu) for that).
redudant ring and udpu are a tentative to workaround the problem
this library error was first seen on a configuration without these
parameters
I have same problem on two cluster with similar node configuration
Ok. Can you please then paste logs from single ring multicast
configuration (ideally centos 6.5)?
I have to find some old log on my tape backup, not easy now
I have  a full debug file but for centos 6.4 it's 164k, I'll send you 
offlist

Regards,
   Honza

Regards,
    Honza

rpm -qa corosync
corosync-1.4.1-17.el6.x86_64

Regards,
     Honza

Alessandro Bono napsal(a):
Hi

after changing cluster from corosync to cman+corosync (switching from
centos 6.3 to 6.4) I have a recurring problem with pacemaker/corosync
pacemaker report this error

pacemakerd:    error: pcmk_cpg_dispatch:     Connection to the CPG API
failed: Library error (2)

and shutdown itself
This normally happens when the host machine is under high load, at
example during a full backup

in addition, there are a lot of these messages

Feb 01 23:27:04 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 23:27:04 corosync [TOTEM ] Automatically recovered ring 1
Feb 01 23:27:06 corosync [TOTEM ] Marking ringid 0 interface
10.12.32.1
FAULTY
Feb 01 23:27:07 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 23:27:07 corosync [TOTEM ] Automatically recovered ring 0
Feb 01 23:27:07 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 23:27:07 corosync [TOTEM ] Automatically recovered ring 0
Feb 01 23:27:09 corosync [TOTEM ] Marking ringid 1 interface
10.12.23.1
FAULTY
Feb 01 23:27:10 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 23:27:10 corosync [TOTEM ] Automatically recovered ring 1
Feb 01 23:27:10 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 23:27:10 corosync [TOTEM ] Automatically recovered ring 0
Feb 01 23:27:12 corosync [TOTEM ] Marking ringid 1 interface
10.12.23.1
FAULTY
Feb 01 23:27:12 corosync [TOTEM ] Marking ringid 0 interface
10.12.32.1
FAULTY
Feb 01 23:27:13 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 23:27:13 corosync [TOTEM ] received message requesting test of
ring now active

I reported this problem to pacemaker ml but they said it's a corosync
problem
same problem with centos 6.5

I tried to switch comunication to udpu and add another comunication
path
but without any luck
cluster nodes are kvm virtual machine

Is it a configuration problem?

some info below, I can provide full log if necessary

rpm -qa  | egrep "pacem|coro"| sort
corosync-1.4.1-17.el6.x86_64
corosynclib-1.4.1-17.el6.x86_64
drbd-pacemaker-8.3.16-1.el6.x86_64
pacemaker-1.1.10-14.el6_5.2.x86_64
pacemaker-cli-1.1.10-14.el6_5.2.x86_64
pacemaker-cluster-libs-1.1.10-14.el6_5.2.x86_64
pacemaker-debuginfo-1.1.10-1.el6.x86_64
pacemaker-libs-1.1.10-14.el6_5.2.x86_64

cat /etc/cluster/cluster.conf
<cluster config_version="8" name="ga-ext_cluster">
<cman transport="udpu"/>
     <logging>
      <logging_daemon name="corosync" debug="on"/>
     </logging>
     <clusternodes>
       <clusternode name="ga1-ext" nodeid="1">
         <fence>
           <method name="pcmk-redirect">
             <device name="pcmk" port="ga1-ext"/>
           </method>
         </fence>
         <altname name="ga1-ext_alt"/>
       </clusternode>
       <clusternode name="ga2-ext" nodeid="2">
         <fence>
           <method name="pcmk-redirect">
             <device name="pcmk" port="ga2-ext"/>
           </method>
         </fence>
         <altname name="ga2-ext_alt"/>
       </clusternode>
     </clusternodes>
     <fencedevices>
       <fencedevice agent="fence_pcmk" name="pcmk"/>
     </fencedevices>
</cluster>

crm configure show
node ga1-ext \
       attributes standby="off"
node ga2-ext \
       attributes standby="off"
primitive ClusterIP ocf:heartbeat:IPaddr \
       params ip="10.12.23.3" cidr_netmask="24" \
       op monitor interval="30s"
primitive SharedFS ocf:heartbeat:Filesystem \
       params device="/dev/drbd/by-res/r0" directory="/shared"
fstype="ext4" options="noatime,nobarrier"
primitive dovecot lsb:dovecot
primitive drbd0 ocf:linbit:drbd \
       params drbd_resource="r0" \
       op monitor interval="15s"
primitive drbdlinks ocf:tummy:drbdlinks
primitive mail ocf:heartbeat:MailTo \
       params email="root@xxxxxxxxxxxxxxxxxxxx" subject="ga-ext cluster
- "
primitive mysql lsb:mysqld
group service_group SharedFS drbdlinks ClusterIP mail mysql dovecot \
       meta target-role="Started"
ms ms_drbd0 drbd0 \
       meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true"
colocation service_on_drbd inf: service_group ms_drbd0:Master
order service_after_drbd inf: ms_drbd0:promote service_group:start
property $id="cib-bootstrap-options" \
       dc-version="1.1.10-14.el6_5.2-368c726" \
       cluster-infrastructure="cman" \
       expected-quorum-votes="2" \
       stonith-enabled="false" \
       no-quorum-policy="ignore" \
       last-lrm-refresh="1391290945" \
       maintenance-mode="false"
rsc_defaults $id="rsc-options" \
       resource-stickiness="100"

extract from cluster.log

Feb 01 21:40:15 corosync [MAIN  ] Completed service synchronization,
ready to provide service.
Feb 01 21:40:15 corosync [TOTEM ] waiting_trans_ack changed to 0
Feb 01 21:40:15 corosync [TOTEM ] Marking ringid 1 interface
10.12.23.1
FAULTY
Feb 01 21:40:15 [13253] ga1-ext        cib:     info: crm_cs_flush:
Sent 4 CPG messages  (0 remaining, last=48): OK (1)
Feb 01 21:40:15 [13256] ga1-ext       crmd:     info: crm_cs_flush:
Sent 3 CPG messages  (0 remaining, last=24): OK (1)
Feb 01 21:40:16 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 21:40:16 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 21:40:16 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 0
Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 1
Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 1
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_diff:         Diff 0.299.3 -> 0.299.4 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_request:      Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='f

ail-count-drbd0']: No such device or address (rc=-6,
origin=local/attrd/34, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_request:      Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l

ast-failure-mysql']: No such device or address (rc=-6,
origin=local/attrd/35, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_request:      Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l

ast-failure-drbd0']: No such device or address (rc=-6,
origin=local/attrd/36, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_diff:         Diff 0.299.4 -> 0.299.5 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_diff:         Diff 0.299.5 -> 0.299.6 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_diff:         Diff 0.299.6 -> 0.299.7 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_diff:         Diff 0.299.7 -> 0.299.8 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_diff:         Diff 0.299.8 -> 0.299.9 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_request:      Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='m

aster-drbd0']: OK (rc=0, origin=local/attrd/37, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_request:      Completed cib_modify operation for section
status: OK (rc=0, origin=local/attrd/38, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_request:      Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l

ast-failure-ClusterIP']: No such device or address (rc=-6,
origin=local/attrd/39, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_request:      Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='p

robe_complete']: OK (rc=0, origin=local/attrd/40, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_request:      Completed cib_modify operation for section
status: OK (rc=0, origin=local/attrd/41, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_request:      Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='m

aster-drbd0']: OK (rc=0, origin=local/attrd/42, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_request:      Completed cib_modify operation for section
status: OK (rc=0, origin=local/attrd/43, version=0.299.11)
Feb 01 21:40:17 [13256] ga1-ext       crmd:     info:
register_fsa_error_adv:   Resetting the current action list
Feb 01 21:40:17 [13256] ga1-ext       crmd:  warning:
crmd_ha_msg_filter:       Another DC detected: ga2-ext (op=noop)
Feb 01 21:40:17 [13256] ga1-ext       crmd:     info:
register_fsa_error_adv:   Resetting the current action list
Feb 01 21:40:17 [13256] ga1-ext       crmd:  warning:
crmd_ha_msg_filter:       Another DC detected: ga2-ext (op=noop)
Feb 01 21:40:17 corosync [CMAN  ] ais: deliver_fn source nodeid = 2,
len=24, endian_conv=0
Feb 01 21:40:17 corosync [CMAN  ] memb: Message on port 0 is 6
Feb 01 21:40:17 corosync [CMAN  ] memb: got KILL for node 1
Feb 01 21:40:17 [13256] ga1-ext       crmd:     info:
register_fsa_error_adv:   Resetting the current action list
Feb 01 21:40:17 [13256] ga1-ext       crmd:  warning:
crmd_ha_msg_filter:       Another DC detected: ga2-ext (op=noop)
Feb 01 21:40:17 [13256] ga1-ext       crmd:     info:
register_fsa_error_adv:   Resetting the current action list
Feb 01 21:40:17 [13256] ga1-ext       crmd:  warning:
crmd_ha_msg_filter:       Another DC detected: ga2-ext (op=join_offer)
Feb 01 21:40:17 [13256] ga1-ext       crmd:     info:
do_state_transition:      State transition S_INTEGRATION ->
S_ELECTION [
input=I_ELECTION cause=C_FSA_INTERNAL origin=crmd_ha_msg_filter ]
Feb 01 21:40:17 [13256] ga1-ext       crmd:     info: update_dc:
Unset DC. Was ga1-ext
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_diff:         Diff 0.299.9 -> 0.299.10 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_diff:         Diff 0.299.10 -> 0.299.11 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:18 [13247] ga1-ext pacemakerd:    error:
pcmk_cpg_dispatch:        Connection to the CPG API failed: Library
error (2)
Feb 01 21:40:18 [13247] ga1-ext pacemakerd:    error: mcp_cpg_destroy:
Connection destroyed
Feb 01 21:40:18 [13247] ga1-ext pacemakerd:     info: crm_xml_cleanup:
Cleaning up memory from libxml2
Feb 01 21:40:18 [13255] ga1-ext      attrd:    error:
pcmk_cpg_dispatch:        Connection to the CPG API failed: Library
error (2)
Feb 01 21:40:18 [13255] ga1-ext      attrd:     crit:
attrd_cs_destroy:         Lost connection to Corosync service!
Feb 01 21:40:18 [13255] ga1-ext      attrd:   notice: main: Exiting...
Feb 01 21:40:18 [13255] ga1-ext      attrd:   notice: main:
Disconnecting client 0x238ff10, pid=13256...
Feb 01 21:40:18 [13255] ga1-ext      attrd:    error:
attrd_cib_connection_destroy:     Connection to the CIB terminated...
Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info:
stonith_shutdown:         Terminating with  1 clients
Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info:
cib_connection_destroy:   Connection to the CIB closed.
Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info:
crm_client_destroy:       Destroying 0 events
Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info:
qb_ipcs_us_withdraw:      withdrawing server sockets
Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info: main:     Done
Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info: crm_xml_cleanup:
Cleaning up memory from libxml2
Feb 01 21:40:18 [13256] ga1-ext       crmd:    error:
pcmk_cpg_dispatch:        Connection to the CPG API failed: Library
error (2)
Feb 01 21:40:18 [13256] ga1-ext       crmd:    error: crmd_cs_destroy:
connection terminated
Feb 01 21:40:18 [13256] ga1-ext       crmd:     info:
qb_ipcs_us_withdraw:      withdrawing server sockets
Feb 01 21:40:18 [13253] ga1-ext        cib:    error:
pcmk_cpg_dispatch:        Connection to the CPG API failed: Library
error (2)
Feb 01 21:40:18 [13253] ga1-ext        cib:    error: cib_cs_destroy:
Corosync connection lost!  Exiting.
Feb 01 21:40:18 [13253] ga1-ext        cib:     info: terminate_cib:
cib_cs_destroy: Exiting fast...
Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
qb_ipcs_us_withdraw:      withdrawing server sockets
Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
crm_client_destroy:       Destroying 0 events
Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
crm_client_destroy:       Destroying 0 events
Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
qb_ipcs_us_withdraw:      withdrawing server sockets
Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
crm_client_destroy:       Destroying 0 events
Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
qb_ipcs_us_withdraw:      withdrawing server sockets
Feb 01 21:40:18 [13253] ga1-ext        cib:     info: crm_xml_cleanup:
Cleaning up memory from libxml2
Feb 01 21:40:18 [13256] ga1-ext       crmd:     info:
tengine_stonith_connection_destroy:       Fencing daemon disconnected
Feb 01 21:40:18 [13256] ga1-ext       crmd:   notice: crmd_exit:
Forcing immediate exit: Link has been severed (67)
Feb 01 21:40:18 [13256] ga1-ext       crmd:     info: crm_xml_cleanup:
Cleaning up memory from libxml2
Feb 01 21:40:18 [25258] ga1-ext       lrmd:     info:
cancel_recurring_action:  Cancelling operation ClusterIP_monitor_30000
Feb 01 21:40:18 [25258] ga1-ext       lrmd:  warning:
qb_ipcs_event_sendv:      new_event_notification (25258-13256-6): Bad
file descriptor (9)
Feb 01 21:40:18 [25258] ga1-ext       lrmd:  warning:
send_client_notify:       Notification of client
crmd/0b3ea733-7340-439c-9f46-81b0d7e1f6a1 failed
Feb 01 21:40:18 [25258] ga1-ext       lrmd:     info:
crm_client_destroy:       Destroying 1 events
Feb 01 21:40:18 [25260] ga1-ext    pengine:     info:
crm_client_destroy:       Destroying 0 events

--
Cordiali saluti

Alessandro Bono

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss