Alessandro, can you find message like "Corosync main process was not scheduled for ... ms" in log file (corosync must be at least 1.4.1-16 so CentOS 6.5)? Regards, Honza Alessandro Bono napsal(a): > Hi > > after changing cluster from corosync to cman+corosync (switching from > centos 6.3 to 6.4) I have a recurring problem with pacemaker/corosync > pacemaker report this error > > pacemakerd: error: pcmk_cpg_dispatch: Connection to the CPG API > failed: Library error (2) > > and shutdown itself > This normally happens when the host machine is under high load, at > example during a full backup > > in addition, there are a lot of these messages > > Feb 01 23:27:04 corosync [TOTEM ] received message requesting test of > ring now active > Feb 01 23:27:04 corosync [TOTEM ] Automatically recovered ring 1 > Feb 01 23:27:06 corosync [TOTEM ] Marking ringid 0 interface 10.12.32.1 > FAULTY > Feb 01 23:27:07 corosync [TOTEM ] received message requesting test of > ring now active > Feb 01 23:27:07 corosync [TOTEM ] Automatically recovered ring 0 > Feb 01 23:27:07 corosync [TOTEM ] received message requesting test of > ring now active > Feb 01 23:27:07 corosync [TOTEM ] Automatically recovered ring 0 > Feb 01 23:27:09 corosync [TOTEM ] Marking ringid 1 interface 10.12.23.1 > FAULTY > Feb 01 23:27:10 corosync [TOTEM ] received message requesting test of > ring now active > Feb 01 23:27:10 corosync [TOTEM ] Automatically recovered ring 1 > Feb 01 23:27:10 corosync [TOTEM ] received message requesting test of > ring now active > Feb 01 23:27:10 corosync [TOTEM ] Automatically recovered ring 0 > Feb 01 23:27:12 corosync [TOTEM ] Marking ringid 1 interface 10.12.23.1 > FAULTY > Feb 01 23:27:12 corosync [TOTEM ] Marking ringid 0 interface 10.12.32.1 > FAULTY > Feb 01 23:27:13 corosync [TOTEM ] received message requesting test of > ring now active > Feb 01 23:27:13 corosync [TOTEM ] received message requesting test of > ring now active > > I reported this problem to pacemaker ml but they said it's a corosync > problem > same problem with centos 6.5 > > I tried to switch comunication to udpu and add another comunication path > but without any luck > cluster nodes are kvm virtual machine > > Is it a configuration problem? > > some info below, I can provide full log if necessary > > rpm -qa | egrep "pacem|coro"| sort > corosync-1.4.1-17.el6.x86_64 > corosynclib-1.4.1-17.el6.x86_64 > drbd-pacemaker-8.3.16-1.el6.x86_64 > pacemaker-1.1.10-14.el6_5.2.x86_64 > pacemaker-cli-1.1.10-14.el6_5.2.x86_64 > pacemaker-cluster-libs-1.1.10-14.el6_5.2.x86_64 > pacemaker-debuginfo-1.1.10-1.el6.x86_64 > pacemaker-libs-1.1.10-14.el6_5.2.x86_64 > > > cat /etc/cluster/cluster.conf > <cluster config_version="8" name="ga-ext_cluster"> > <cman transport="udpu"/> > <logging> > <logging_daemon name="corosync" debug="on"/> > </logging> > <clusternodes> > <clusternode name="ga1-ext" nodeid="1"> > <fence> > <method name="pcmk-redirect"> > <device name="pcmk" port="ga1-ext"/> > </method> > </fence> > <altname name="ga1-ext_alt"/> > </clusternode> > <clusternode name="ga2-ext" nodeid="2"> > <fence> > <method name="pcmk-redirect"> > <device name="pcmk" port="ga2-ext"/> > </method> > </fence> > <altname name="ga2-ext_alt"/> > </clusternode> > </clusternodes> > <fencedevices> > <fencedevice agent="fence_pcmk" name="pcmk"/> > </fencedevices> > </cluster> > > crm configure show > node ga1-ext \ > attributes standby="off" > node ga2-ext \ > attributes standby="off" > primitive ClusterIP ocf:heartbeat:IPaddr \ > params ip="10.12.23.3" cidr_netmask="24" \ > op monitor interval="30s" > primitive SharedFS ocf:heartbeat:Filesystem \ > params device="/dev/drbd/by-res/r0" directory="/shared" > fstype="ext4" options="noatime,nobarrier" > primitive dovecot lsb:dovecot > primitive drbd0 ocf:linbit:drbd \ > params drbd_resource="r0" \ > op monitor interval="15s" > primitive drbdlinks ocf:tummy:drbdlinks > primitive mail ocf:heartbeat:MailTo \ > params email="root@xxxxxxxxxxxxxxxxxxxx" subject="ga-ext cluster - " > primitive mysql lsb:mysqld > group service_group SharedFS drbdlinks ClusterIP mail mysql dovecot \ > meta target-role="Started" > ms ms_drbd0 drbd0 \ > meta master-max="1" master-node-max="1" clone-max="2" > clone-node-max="1" notify="true" > colocation service_on_drbd inf: service_group ms_drbd0:Master > order service_after_drbd inf: ms_drbd0:promote service_group:start > property $id="cib-bootstrap-options" \ > dc-version="1.1.10-14.el6_5.2-368c726" \ > cluster-infrastructure="cman" \ > expected-quorum-votes="2" \ > stonith-enabled="false" \ > no-quorum-policy="ignore" \ > last-lrm-refresh="1391290945" \ > maintenance-mode="false" > rsc_defaults $id="rsc-options" \ > resource-stickiness="100" > > extract from cluster.log > > Feb 01 21:40:15 corosync [MAIN ] Completed service synchronization, > ready to provide service. > Feb 01 21:40:15 corosync [TOTEM ] waiting_trans_ack changed to 0 > Feb 01 21:40:15 corosync [TOTEM ] Marking ringid 1 interface 10.12.23.1 > FAULTY > Feb 01 21:40:15 [13253] ga1-ext cib: info: crm_cs_flush: > Sent 4 CPG messages (0 remaining, last=48): OK (1) > Feb 01 21:40:15 [13256] ga1-ext crmd: info: crm_cs_flush: > Sent 3 CPG messages (0 remaining, last=24): OK (1) > Feb 01 21:40:16 corosync [TOTEM ] received message requesting test of > ring now active > Feb 01 21:40:16 corosync [TOTEM ] received message requesting test of > ring now active > Feb 01 21:40:16 corosync [TOTEM ] received message requesting test of > ring now active > Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 0 > Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 1 > Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 1 > Feb 01 21:40:17 [13253] ga1-ext cib: info: > cib_process_diff: Diff 0.299.3 -> 0.299.4 from ga2-ext not > applied to 0.299.11: current "num_updates" is greater than required > Feb 01 21:40:17 [13253] ga1-ext cib: info: > cib_process_request: Completed cib_query operation for section > //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='f > > ail-count-drbd0']: No such device or address (rc=-6, > origin=local/attrd/34, version=0.299.11) > Feb 01 21:40:17 [13253] ga1-ext cib: info: > cib_process_request: Completed cib_query operation for section > //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l > > ast-failure-mysql']: No such device or address (rc=-6, > origin=local/attrd/35, version=0.299.11) > Feb 01 21:40:17 [13253] ga1-ext cib: info: > cib_process_request: Completed cib_query operation for section > //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l > > ast-failure-drbd0']: No such device or address (rc=-6, > origin=local/attrd/36, version=0.299.11) > Feb 01 21:40:17 [13253] ga1-ext cib: info: > cib_process_diff: Diff 0.299.4 -> 0.299.5 from ga2-ext not > applied to 0.299.11: current "num_updates" is greater than required > Feb 01 21:40:17 [13253] ga1-ext cib: info: > cib_process_diff: Diff 0.299.5 -> 0.299.6 from ga2-ext not > applied to 0.299.11: current "num_updates" is greater than required > Feb 01 21:40:17 [13253] ga1-ext cib: info: > cib_process_diff: Diff 0.299.6 -> 0.299.7 from ga2-ext not > applied to 0.299.11: current "num_updates" is greater than required > Feb 01 21:40:17 [13253] ga1-ext cib: info: > cib_process_diff: Diff 0.299.7 -> 0.299.8 from ga2-ext not > applied to 0.299.11: current "num_updates" is greater than required > Feb 01 21:40:17 [13253] ga1-ext cib: info: > cib_process_diff: Diff 0.299.8 -> 0.299.9 from ga2-ext not > applied to 0.299.11: current "num_updates" is greater than required > Feb 01 21:40:17 [13253] ga1-ext cib: info: > cib_process_request: Completed cib_query operation for section > //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='m > > aster-drbd0']: OK (rc=0, origin=local/attrd/37, version=0.299.11) > Feb 01 21:40:17 [13253] ga1-ext cib: info: > cib_process_request: Completed cib_modify operation for section > status: OK (rc=0, origin=local/attrd/38, version=0.299.11) > Feb 01 21:40:17 [13253] ga1-ext cib: info: > cib_process_request: Completed cib_query operation for section > //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l > > ast-failure-ClusterIP']: No such device or address (rc=-6, > origin=local/attrd/39, version=0.299.11) > Feb 01 21:40:17 [13253] ga1-ext cib: info: > cib_process_request: Completed cib_query operation for section > //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='p > > robe_complete']: OK (rc=0, origin=local/attrd/40, version=0.299.11) > Feb 01 21:40:17 [13253] ga1-ext cib: info: > cib_process_request: Completed cib_modify operation for section > status: OK (rc=0, origin=local/attrd/41, version=0.299.11) > Feb 01 21:40:17 [13253] ga1-ext cib: info: > cib_process_request: Completed cib_query operation for section > //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='m > > aster-drbd0']: OK (rc=0, origin=local/attrd/42, version=0.299.11) > Feb 01 21:40:17 [13253] ga1-ext cib: info: > cib_process_request: Completed cib_modify operation for section > status: OK (rc=0, origin=local/attrd/43, version=0.299.11) > Feb 01 21:40:17 [13256] ga1-ext crmd: info: > register_fsa_error_adv: Resetting the current action list > Feb 01 21:40:17 [13256] ga1-ext crmd: warning: > crmd_ha_msg_filter: Another DC detected: ga2-ext (op=noop) > Feb 01 21:40:17 [13256] ga1-ext crmd: info: > register_fsa_error_adv: Resetting the current action list > Feb 01 21:40:17 [13256] ga1-ext crmd: warning: > crmd_ha_msg_filter: Another DC detected: ga2-ext (op=noop) > Feb 01 21:40:17 corosync [CMAN ] ais: deliver_fn source nodeid = 2, > len=24, endian_conv=0 > Feb 01 21:40:17 corosync [CMAN ] memb: Message on port 0 is 6 > Feb 01 21:40:17 corosync [CMAN ] memb: got KILL for node 1 > Feb 01 21:40:17 [13256] ga1-ext crmd: info: > register_fsa_error_adv: Resetting the current action list > Feb 01 21:40:17 [13256] ga1-ext crmd: warning: > crmd_ha_msg_filter: Another DC detected: ga2-ext (op=noop) > Feb 01 21:40:17 [13256] ga1-ext crmd: info: > register_fsa_error_adv: Resetting the current action list > Feb 01 21:40:17 [13256] ga1-ext crmd: warning: > crmd_ha_msg_filter: Another DC detected: ga2-ext (op=join_offer) > Feb 01 21:40:17 [13256] ga1-ext crmd: info: > do_state_transition: State transition S_INTEGRATION -> S_ELECTION [ > input=I_ELECTION cause=C_FSA_INTERNAL origin=crmd_ha_msg_filter ] > Feb 01 21:40:17 [13256] ga1-ext crmd: info: update_dc: > Unset DC. Was ga1-ext > Feb 01 21:40:17 [13253] ga1-ext cib: info: > cib_process_diff: Diff 0.299.9 -> 0.299.10 from ga2-ext not > applied to 0.299.11: current "num_updates" is greater than required > Feb 01 21:40:17 [13253] ga1-ext cib: info: > cib_process_diff: Diff 0.299.10 -> 0.299.11 from ga2-ext not > applied to 0.299.11: current "num_updates" is greater than required > Feb 01 21:40:18 [13247] ga1-ext pacemakerd: error: > pcmk_cpg_dispatch: Connection to the CPG API failed: Library > error (2) > Feb 01 21:40:18 [13247] ga1-ext pacemakerd: error: mcp_cpg_destroy: > Connection destroyed > Feb 01 21:40:18 [13247] ga1-ext pacemakerd: info: crm_xml_cleanup: > Cleaning up memory from libxml2 > Feb 01 21:40:18 [13255] ga1-ext attrd: error: > pcmk_cpg_dispatch: Connection to the CPG API failed: Library > error (2) > Feb 01 21:40:18 [13255] ga1-ext attrd: crit: > attrd_cs_destroy: Lost connection to Corosync service! > Feb 01 21:40:18 [13255] ga1-ext attrd: notice: main: Exiting... > Feb 01 21:40:18 [13255] ga1-ext attrd: notice: main: > Disconnecting client 0x238ff10, pid=13256... > Feb 01 21:40:18 [13255] ga1-ext attrd: error: > attrd_cib_connection_destroy: Connection to the CIB terminated... > Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info: > stonith_shutdown: Terminating with 1 clients > Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info: > cib_connection_destroy: Connection to the CIB closed. > Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info: > crm_client_destroy: Destroying 0 events > Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info: > qb_ipcs_us_withdraw: withdrawing server sockets > Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info: main: Done > Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info: crm_xml_cleanup: > Cleaning up memory from libxml2 > Feb 01 21:40:18 [13256] ga1-ext crmd: error: > pcmk_cpg_dispatch: Connection to the CPG API failed: Library > error (2) > Feb 01 21:40:18 [13256] ga1-ext crmd: error: crmd_cs_destroy: > connection terminated > Feb 01 21:40:18 [13256] ga1-ext crmd: info: > qb_ipcs_us_withdraw: withdrawing server sockets > Feb 01 21:40:18 [13253] ga1-ext cib: error: > pcmk_cpg_dispatch: Connection to the CPG API failed: Library > error (2) > Feb 01 21:40:18 [13253] ga1-ext cib: error: cib_cs_destroy: > Corosync connection lost! Exiting. > Feb 01 21:40:18 [13253] ga1-ext cib: info: terminate_cib: > cib_cs_destroy: Exiting fast... > Feb 01 21:40:18 [13253] ga1-ext cib: info: > qb_ipcs_us_withdraw: withdrawing server sockets > Feb 01 21:40:18 [13253] ga1-ext cib: info: > crm_client_destroy: Destroying 0 events > Feb 01 21:40:18 [13253] ga1-ext cib: info: > crm_client_destroy: Destroying 0 events > Feb 01 21:40:18 [13253] ga1-ext cib: info: > qb_ipcs_us_withdraw: withdrawing server sockets > Feb 01 21:40:18 [13253] ga1-ext cib: info: > crm_client_destroy: Destroying 0 events > Feb 01 21:40:18 [13253] ga1-ext cib: info: > qb_ipcs_us_withdraw: withdrawing server sockets > Feb 01 21:40:18 [13253] ga1-ext cib: info: crm_xml_cleanup: > Cleaning up memory from libxml2 > Feb 01 21:40:18 [13256] ga1-ext crmd: info: > tengine_stonith_connection_destroy: Fencing daemon disconnected > Feb 01 21:40:18 [13256] ga1-ext crmd: notice: crmd_exit: > Forcing immediate exit: Link has been severed (67) > Feb 01 21:40:18 [13256] ga1-ext crmd: info: crm_xml_cleanup: > Cleaning up memory from libxml2 > Feb 01 21:40:18 [25258] ga1-ext lrmd: info: > cancel_recurring_action: Cancelling operation ClusterIP_monitor_30000 > Feb 01 21:40:18 [25258] ga1-ext lrmd: warning: > qb_ipcs_event_sendv: new_event_notification (25258-13256-6): Bad > file descriptor (9) > Feb 01 21:40:18 [25258] ga1-ext lrmd: warning: > send_client_notify: Notification of client > crmd/0b3ea733-7340-439c-9f46-81b0d7e1f6a1 failed > Feb 01 21:40:18 [25258] ga1-ext lrmd: info: > crm_client_destroy: Destroying 1 events > Feb 01 21:40:18 [25260] ga1-ext pengine: info: > crm_client_destroy: Destroying 0 events > _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss