Re: pacemaker "CPG API: failed Library error"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Il 10/02/14 10:47, Jan Friesse ha scritto:
Alessandro,
can you find message like "Corosync main process was not scheduled for
... ms" in log file (corosync must be at least 1.4.1-16 so CentOS 6.5)?
Hi

there is no a message like that in log file
distro is centos 6.5

rpm -qa corosync
corosync-1.4.1-17.el6.x86_64


Regards,
   Honza

Alessandro Bono napsal(a):
Hi

after changing cluster from corosync to cman+corosync (switching from
centos 6.3 to 6.4) I have a recurring problem with pacemaker/corosync
pacemaker report this error

pacemakerd:    error: pcmk_cpg_dispatch:     Connection to the CPG API
failed: Library error (2)

and shutdown itself
This normally happens when the host machine is under high load, at
example during a full backup

in addition, there are a lot of these messages

Feb 01 23:27:04 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 23:27:04 corosync [TOTEM ] Automatically recovered ring 1
Feb 01 23:27:06 corosync [TOTEM ] Marking ringid 0 interface 10.12.32.1
FAULTY
Feb 01 23:27:07 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 23:27:07 corosync [TOTEM ] Automatically recovered ring 0
Feb 01 23:27:07 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 23:27:07 corosync [TOTEM ] Automatically recovered ring 0
Feb 01 23:27:09 corosync [TOTEM ] Marking ringid 1 interface 10.12.23.1
FAULTY
Feb 01 23:27:10 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 23:27:10 corosync [TOTEM ] Automatically recovered ring 1
Feb 01 23:27:10 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 23:27:10 corosync [TOTEM ] Automatically recovered ring 0
Feb 01 23:27:12 corosync [TOTEM ] Marking ringid 1 interface 10.12.23.1
FAULTY
Feb 01 23:27:12 corosync [TOTEM ] Marking ringid 0 interface 10.12.32.1
FAULTY
Feb 01 23:27:13 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 23:27:13 corosync [TOTEM ] received message requesting test of
ring now active

I reported this problem to pacemaker ml but they said it's a corosync
problem
same problem with centos 6.5

I tried to switch comunication to udpu and add another comunication path
but without any luck
cluster nodes are kvm virtual machine

Is it a configuration problem?

some info below, I can provide full log if necessary

rpm -qa  | egrep "pacem|coro"| sort
corosync-1.4.1-17.el6.x86_64
corosynclib-1.4.1-17.el6.x86_64
drbd-pacemaker-8.3.16-1.el6.x86_64
pacemaker-1.1.10-14.el6_5.2.x86_64
pacemaker-cli-1.1.10-14.el6_5.2.x86_64
pacemaker-cluster-libs-1.1.10-14.el6_5.2.x86_64
pacemaker-debuginfo-1.1.10-1.el6.x86_64
pacemaker-libs-1.1.10-14.el6_5.2.x86_64


cat /etc/cluster/cluster.conf
<cluster config_version="8" name="ga-ext_cluster">
<cman transport="udpu"/>
   <logging>
    <logging_daemon name="corosync" debug="on"/>
   </logging>
   <clusternodes>
     <clusternode name="ga1-ext" nodeid="1">
       <fence>
         <method name="pcmk-redirect">
           <device name="pcmk" port="ga1-ext"/>
         </method>
       </fence>
       <altname name="ga1-ext_alt"/>
     </clusternode>
     <clusternode name="ga2-ext" nodeid="2">
       <fence>
         <method name="pcmk-redirect">
           <device name="pcmk" port="ga2-ext"/>
         </method>
       </fence>
       <altname name="ga2-ext_alt"/>
     </clusternode>
   </clusternodes>
   <fencedevices>
     <fencedevice agent="fence_pcmk" name="pcmk"/>
   </fencedevices>
</cluster>

crm configure show
node ga1-ext \
     attributes standby="off"
node ga2-ext \
     attributes standby="off"
primitive ClusterIP ocf:heartbeat:IPaddr \
     params ip="10.12.23.3" cidr_netmask="24" \
     op monitor interval="30s"
primitive SharedFS ocf:heartbeat:Filesystem \
     params device="/dev/drbd/by-res/r0" directory="/shared"
fstype="ext4" options="noatime,nobarrier"
primitive dovecot lsb:dovecot
primitive drbd0 ocf:linbit:drbd \
     params drbd_resource="r0" \
     op monitor interval="15s"
primitive drbdlinks ocf:tummy:drbdlinks
primitive mail ocf:heartbeat:MailTo \
     params email="root@xxxxxxxxxxxxxxxxxxxx" subject="ga-ext cluster - "
primitive mysql lsb:mysqld
group service_group SharedFS drbdlinks ClusterIP mail mysql dovecot \
     meta target-role="Started"
ms ms_drbd0 drbd0 \
     meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true"
colocation service_on_drbd inf: service_group ms_drbd0:Master
order service_after_drbd inf: ms_drbd0:promote service_group:start
property $id="cib-bootstrap-options" \
     dc-version="1.1.10-14.el6_5.2-368c726" \
     cluster-infrastructure="cman" \
     expected-quorum-votes="2" \
     stonith-enabled="false" \
     no-quorum-policy="ignore" \
     last-lrm-refresh="1391290945" \
     maintenance-mode="false"
rsc_defaults $id="rsc-options" \
     resource-stickiness="100"

extract from cluster.log

Feb 01 21:40:15 corosync [MAIN  ] Completed service synchronization,
ready to provide service.
Feb 01 21:40:15 corosync [TOTEM ] waiting_trans_ack changed to 0
Feb 01 21:40:15 corosync [TOTEM ] Marking ringid 1 interface 10.12.23.1
FAULTY
Feb 01 21:40:15 [13253] ga1-ext        cib:     info: crm_cs_flush:
Sent 4 CPG messages  (0 remaining, last=48): OK (1)
Feb 01 21:40:15 [13256] ga1-ext       crmd:     info: crm_cs_flush:
Sent 3 CPG messages  (0 remaining, last=24): OK (1)
Feb 01 21:40:16 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 21:40:16 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 21:40:16 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 0
Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 1
Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 1
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_diff:         Diff 0.299.3 -> 0.299.4 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_request:      Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='f

ail-count-drbd0']: No such device or address (rc=-6,
origin=local/attrd/34, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_request:      Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l

ast-failure-mysql']: No such device or address (rc=-6,
origin=local/attrd/35, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_request:      Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l

ast-failure-drbd0']: No such device or address (rc=-6,
origin=local/attrd/36, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_diff:         Diff 0.299.4 -> 0.299.5 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_diff:         Diff 0.299.5 -> 0.299.6 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_diff:         Diff 0.299.6 -> 0.299.7 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_diff:         Diff 0.299.7 -> 0.299.8 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_diff:         Diff 0.299.8 -> 0.299.9 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_request:      Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='m

aster-drbd0']: OK (rc=0, origin=local/attrd/37, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_request:      Completed cib_modify operation for section
status: OK (rc=0, origin=local/attrd/38, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_request:      Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l

ast-failure-ClusterIP']: No such device or address (rc=-6,
origin=local/attrd/39, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_request:      Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='p

robe_complete']: OK (rc=0, origin=local/attrd/40, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_request:      Completed cib_modify operation for section
status: OK (rc=0, origin=local/attrd/41, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_request:      Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='m

aster-drbd0']: OK (rc=0, origin=local/attrd/42, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_request:      Completed cib_modify operation for section
status: OK (rc=0, origin=local/attrd/43, version=0.299.11)
Feb 01 21:40:17 [13256] ga1-ext       crmd:     info:
register_fsa_error_adv:   Resetting the current action list
Feb 01 21:40:17 [13256] ga1-ext       crmd:  warning:
crmd_ha_msg_filter:       Another DC detected: ga2-ext (op=noop)
Feb 01 21:40:17 [13256] ga1-ext       crmd:     info:
register_fsa_error_adv:   Resetting the current action list
Feb 01 21:40:17 [13256] ga1-ext       crmd:  warning:
crmd_ha_msg_filter:       Another DC detected: ga2-ext (op=noop)
Feb 01 21:40:17 corosync [CMAN  ] ais: deliver_fn source nodeid = 2,
len=24, endian_conv=0
Feb 01 21:40:17 corosync [CMAN  ] memb: Message on port 0 is 6
Feb 01 21:40:17 corosync [CMAN  ] memb: got KILL for node 1
Feb 01 21:40:17 [13256] ga1-ext       crmd:     info:
register_fsa_error_adv:   Resetting the current action list
Feb 01 21:40:17 [13256] ga1-ext       crmd:  warning:
crmd_ha_msg_filter:       Another DC detected: ga2-ext (op=noop)
Feb 01 21:40:17 [13256] ga1-ext       crmd:     info:
register_fsa_error_adv:   Resetting the current action list
Feb 01 21:40:17 [13256] ga1-ext       crmd:  warning:
crmd_ha_msg_filter:       Another DC detected: ga2-ext (op=join_offer)
Feb 01 21:40:17 [13256] ga1-ext       crmd:     info:
do_state_transition:      State transition S_INTEGRATION -> S_ELECTION [
input=I_ELECTION cause=C_FSA_INTERNAL origin=crmd_ha_msg_filter ]
Feb 01 21:40:17 [13256] ga1-ext       crmd:     info: update_dc:
Unset DC. Was ga1-ext
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_diff:         Diff 0.299.9 -> 0.299.10 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext        cib:     info:
cib_process_diff:         Diff 0.299.10 -> 0.299.11 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:18 [13247] ga1-ext pacemakerd:    error:
pcmk_cpg_dispatch:        Connection to the CPG API failed: Library
error (2)
Feb 01 21:40:18 [13247] ga1-ext pacemakerd:    error: mcp_cpg_destroy:
Connection destroyed
Feb 01 21:40:18 [13247] ga1-ext pacemakerd:     info: crm_xml_cleanup:
Cleaning up memory from libxml2
Feb 01 21:40:18 [13255] ga1-ext      attrd:    error:
pcmk_cpg_dispatch:        Connection to the CPG API failed: Library
error (2)
Feb 01 21:40:18 [13255] ga1-ext      attrd:     crit:
attrd_cs_destroy:         Lost connection to Corosync service!
Feb 01 21:40:18 [13255] ga1-ext      attrd:   notice: main: Exiting...
Feb 01 21:40:18 [13255] ga1-ext      attrd:   notice: main:
Disconnecting client 0x238ff10, pid=13256...
Feb 01 21:40:18 [13255] ga1-ext      attrd:    error:
attrd_cib_connection_destroy:     Connection to the CIB terminated...
Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info:
stonith_shutdown:         Terminating with  1 clients
Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info:
cib_connection_destroy:   Connection to the CIB closed.
Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info:
crm_client_destroy:       Destroying 0 events
Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info:
qb_ipcs_us_withdraw:      withdrawing server sockets
Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info: main:     Done
Feb 01 21:40:18 [13254] ga1-ext stonith-ng:     info: crm_xml_cleanup:
Cleaning up memory from libxml2
Feb 01 21:40:18 [13256] ga1-ext       crmd:    error:
pcmk_cpg_dispatch:        Connection to the CPG API failed: Library
error (2)
Feb 01 21:40:18 [13256] ga1-ext       crmd:    error: crmd_cs_destroy:
connection terminated
Feb 01 21:40:18 [13256] ga1-ext       crmd:     info:
qb_ipcs_us_withdraw:      withdrawing server sockets
Feb 01 21:40:18 [13253] ga1-ext        cib:    error:
pcmk_cpg_dispatch:        Connection to the CPG API failed: Library
error (2)
Feb 01 21:40:18 [13253] ga1-ext        cib:    error: cib_cs_destroy:
Corosync connection lost!  Exiting.
Feb 01 21:40:18 [13253] ga1-ext        cib:     info: terminate_cib:
cib_cs_destroy: Exiting fast...
Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
qb_ipcs_us_withdraw:      withdrawing server sockets
Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
crm_client_destroy:       Destroying 0 events
Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
crm_client_destroy:       Destroying 0 events
Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
qb_ipcs_us_withdraw:      withdrawing server sockets
Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
crm_client_destroy:       Destroying 0 events
Feb 01 21:40:18 [13253] ga1-ext        cib:     info:
qb_ipcs_us_withdraw:      withdrawing server sockets
Feb 01 21:40:18 [13253] ga1-ext        cib:     info: crm_xml_cleanup:
Cleaning up memory from libxml2
Feb 01 21:40:18 [13256] ga1-ext       crmd:     info:
tengine_stonith_connection_destroy:       Fencing daemon disconnected
Feb 01 21:40:18 [13256] ga1-ext       crmd:   notice: crmd_exit:
Forcing immediate exit: Link has been severed (67)
Feb 01 21:40:18 [13256] ga1-ext       crmd:     info: crm_xml_cleanup:
Cleaning up memory from libxml2
Feb 01 21:40:18 [25258] ga1-ext       lrmd:     info:
cancel_recurring_action:  Cancelling operation ClusterIP_monitor_30000
Feb 01 21:40:18 [25258] ga1-ext       lrmd:  warning:
qb_ipcs_event_sendv:      new_event_notification (25258-13256-6): Bad
file descriptor (9)
Feb 01 21:40:18 [25258] ga1-ext       lrmd:  warning:
send_client_notify:       Notification of client
crmd/0b3ea733-7340-439c-9f46-81b0d7e1f6a1 failed
Feb 01 21:40:18 [25258] ga1-ext       lrmd:     info:
crm_client_destroy:       Destroying 1 events
Feb 01 21:40:18 [25260] ga1-ext    pengine:     info:
crm_client_destroy:       Destroying 0 events


--
Cordiali saluti

Alessandro Bono

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss




[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux