Alessandro Bono napsal(a): > Il 10/02/14 10:47, Jan Friesse ha scritto: >> Alessandro, >> can you find message like "Corosync main process was not scheduled for >> ... ms" in log file (corosync must be at least 1.4.1-16 so CentOS 6.5)? > Hi > > there is no a message like that in log file > distro is centos 6.5 ok. So first thing to try is to remove redundant ring (just remove altname tags) and see, if problem is still existing. If so, give a try standard multicast (so remove udpu) but make sure to enable multicast_querier (echo 1 > /sys/class/net/$NETWORK_IFACE/bridge/multicast_querier, I'm using libvirt qemu hook (/etc/libvirt/hooks/qemu) for that). Regards, Honza > > rpm -qa corosync > corosync-1.4.1-17.el6.x86_64 > >> >> Regards, >> Honza >> >> Alessandro Bono napsal(a): >>> Hi >>> >>> after changing cluster from corosync to cman+corosync (switching from >>> centos 6.3 to 6.4) I have a recurring problem with pacemaker/corosync >>> pacemaker report this error >>> >>> pacemakerd: error: pcmk_cpg_dispatch: Connection to the CPG API >>> failed: Library error (2) >>> >>> and shutdown itself >>> This normally happens when the host machine is under high load, at >>> example during a full backup >>> >>> in addition, there are a lot of these messages >>> >>> Feb 01 23:27:04 corosync [TOTEM ] received message requesting test of >>> ring now active >>> Feb 01 23:27:04 corosync [TOTEM ] Automatically recovered ring 1 >>> Feb 01 23:27:06 corosync [TOTEM ] Marking ringid 0 interface 10.12.32.1 >>> FAULTY >>> Feb 01 23:27:07 corosync [TOTEM ] received message requesting test of >>> ring now active >>> Feb 01 23:27:07 corosync [TOTEM ] Automatically recovered ring 0 >>> Feb 01 23:27:07 corosync [TOTEM ] received message requesting test of >>> ring now active >>> Feb 01 23:27:07 corosync [TOTEM ] Automatically recovered ring 0 >>> Feb 01 23:27:09 corosync [TOTEM ] Marking ringid 1 interface 10.12.23.1 >>> FAULTY >>> Feb 01 23:27:10 corosync [TOTEM ] received message requesting test of >>> ring now active >>> Feb 01 23:27:10 corosync [TOTEM ] Automatically recovered ring 1 >>> Feb 01 23:27:10 corosync [TOTEM ] received message requesting test of >>> ring now active >>> Feb 01 23:27:10 corosync [TOTEM ] Automatically recovered ring 0 >>> Feb 01 23:27:12 corosync [TOTEM ] Marking ringid 1 interface 10.12.23.1 >>> FAULTY >>> Feb 01 23:27:12 corosync [TOTEM ] Marking ringid 0 interface 10.12.32.1 >>> FAULTY >>> Feb 01 23:27:13 corosync [TOTEM ] received message requesting test of >>> ring now active >>> Feb 01 23:27:13 corosync [TOTEM ] received message requesting test of >>> ring now active >>> >>> I reported this problem to pacemaker ml but they said it's a corosync >>> problem >>> same problem with centos 6.5 >>> >>> I tried to switch comunication to udpu and add another comunication path >>> but without any luck >>> cluster nodes are kvm virtual machine >>> >>> Is it a configuration problem? >>> >>> some info below, I can provide full log if necessary >>> >>> rpm -qa | egrep "pacem|coro"| sort >>> corosync-1.4.1-17.el6.x86_64 >>> corosynclib-1.4.1-17.el6.x86_64 >>> drbd-pacemaker-8.3.16-1.el6.x86_64 >>> pacemaker-1.1.10-14.el6_5.2.x86_64 >>> pacemaker-cli-1.1.10-14.el6_5.2.x86_64 >>> pacemaker-cluster-libs-1.1.10-14.el6_5.2.x86_64 >>> pacemaker-debuginfo-1.1.10-1.el6.x86_64 >>> pacemaker-libs-1.1.10-14.el6_5.2.x86_64 >>> >>> >>> cat /etc/cluster/cluster.conf >>> <cluster config_version="8" name="ga-ext_cluster"> >>> <cman transport="udpu"/> >>> <logging> >>> <logging_daemon name="corosync" debug="on"/> >>> </logging> >>> <clusternodes> >>> <clusternode name="ga1-ext" nodeid="1"> >>> <fence> >>> <method name="pcmk-redirect"> >>> <device name="pcmk" port="ga1-ext"/> >>> </method> >>> </fence> >>> <altname name="ga1-ext_alt"/> >>> </clusternode> >>> <clusternode name="ga2-ext" nodeid="2"> >>> <fence> >>> <method name="pcmk-redirect"> >>> <device name="pcmk" port="ga2-ext"/> >>> </method> >>> </fence> >>> <altname name="ga2-ext_alt"/> >>> </clusternode> >>> </clusternodes> >>> <fencedevices> >>> <fencedevice agent="fence_pcmk" name="pcmk"/> >>> </fencedevices> >>> </cluster> >>> >>> crm configure show >>> node ga1-ext \ >>> attributes standby="off" >>> node ga2-ext \ >>> attributes standby="off" >>> primitive ClusterIP ocf:heartbeat:IPaddr \ >>> params ip="10.12.23.3" cidr_netmask="24" \ >>> op monitor interval="30s" >>> primitive SharedFS ocf:heartbeat:Filesystem \ >>> params device="/dev/drbd/by-res/r0" directory="/shared" >>> fstype="ext4" options="noatime,nobarrier" >>> primitive dovecot lsb:dovecot >>> primitive drbd0 ocf:linbit:drbd \ >>> params drbd_resource="r0" \ >>> op monitor interval="15s" >>> primitive drbdlinks ocf:tummy:drbdlinks >>> primitive mail ocf:heartbeat:MailTo \ >>> params email="root@xxxxxxxxxxxxxxxxxxxx" subject="ga-ext cluster >>> - " >>> primitive mysql lsb:mysqld >>> group service_group SharedFS drbdlinks ClusterIP mail mysql dovecot \ >>> meta target-role="Started" >>> ms ms_drbd0 drbd0 \ >>> meta master-max="1" master-node-max="1" clone-max="2" >>> clone-node-max="1" notify="true" >>> colocation service_on_drbd inf: service_group ms_drbd0:Master >>> order service_after_drbd inf: ms_drbd0:promote service_group:start >>> property $id="cib-bootstrap-options" \ >>> dc-version="1.1.10-14.el6_5.2-368c726" \ >>> cluster-infrastructure="cman" \ >>> expected-quorum-votes="2" \ >>> stonith-enabled="false" \ >>> no-quorum-policy="ignore" \ >>> last-lrm-refresh="1391290945" \ >>> maintenance-mode="false" >>> rsc_defaults $id="rsc-options" \ >>> resource-stickiness="100" >>> >>> extract from cluster.log >>> >>> Feb 01 21:40:15 corosync [MAIN ] Completed service synchronization, >>> ready to provide service. >>> Feb 01 21:40:15 corosync [TOTEM ] waiting_trans_ack changed to 0 >>> Feb 01 21:40:15 corosync [TOTEM ] Marking ringid 1 interface 10.12.23.1 >>> FAULTY >>> Feb 01 21:40:15 [13253] ga1-ext cib: info: crm_cs_flush: >>> Sent 4 CPG messages (0 remaining, last=48): OK (1) >>> Feb 01 21:40:15 [13256] ga1-ext crmd: info: crm_cs_flush: >>> Sent 3 CPG messages (0 remaining, last=24): OK (1) >>> Feb 01 21:40:16 corosync [TOTEM ] received message requesting test of >>> ring now active >>> Feb 01 21:40:16 corosync [TOTEM ] received message requesting test of >>> ring now active >>> Feb 01 21:40:16 corosync [TOTEM ] received message requesting test of >>> ring now active >>> Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 0 >>> Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 1 >>> Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 1 >>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>> cib_process_diff: Diff 0.299.3 -> 0.299.4 from ga2-ext not >>> applied to 0.299.11: current "num_updates" is greater than required >>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>> cib_process_request: Completed cib_query operation for section >>> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='f >>> >>> >>> ail-count-drbd0']: No such device or address (rc=-6, >>> origin=local/attrd/34, version=0.299.11) >>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>> cib_process_request: Completed cib_query operation for section >>> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l >>> >>> >>> ast-failure-mysql']: No such device or address (rc=-6, >>> origin=local/attrd/35, version=0.299.11) >>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>> cib_process_request: Completed cib_query operation for section >>> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l >>> >>> >>> ast-failure-drbd0']: No such device or address (rc=-6, >>> origin=local/attrd/36, version=0.299.11) >>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>> cib_process_diff: Diff 0.299.4 -> 0.299.5 from ga2-ext not >>> applied to 0.299.11: current "num_updates" is greater than required >>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>> cib_process_diff: Diff 0.299.5 -> 0.299.6 from ga2-ext not >>> applied to 0.299.11: current "num_updates" is greater than required >>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>> cib_process_diff: Diff 0.299.6 -> 0.299.7 from ga2-ext not >>> applied to 0.299.11: current "num_updates" is greater than required >>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>> cib_process_diff: Diff 0.299.7 -> 0.299.8 from ga2-ext not >>> applied to 0.299.11: current "num_updates" is greater than required >>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>> cib_process_diff: Diff 0.299.8 -> 0.299.9 from ga2-ext not >>> applied to 0.299.11: current "num_updates" is greater than required >>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>> cib_process_request: Completed cib_query operation for section >>> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='m >>> >>> >>> aster-drbd0']: OK (rc=0, origin=local/attrd/37, version=0.299.11) >>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>> cib_process_request: Completed cib_modify operation for section >>> status: OK (rc=0, origin=local/attrd/38, version=0.299.11) >>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>> cib_process_request: Completed cib_query operation for section >>> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l >>> >>> >>> ast-failure-ClusterIP']: No such device or address (rc=-6, >>> origin=local/attrd/39, version=0.299.11) >>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>> cib_process_request: Completed cib_query operation for section >>> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='p >>> >>> >>> robe_complete']: OK (rc=0, origin=local/attrd/40, version=0.299.11) >>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>> cib_process_request: Completed cib_modify operation for section >>> status: OK (rc=0, origin=local/attrd/41, version=0.299.11) >>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>> cib_process_request: Completed cib_query operation for section >>> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='m >>> >>> >>> aster-drbd0']: OK (rc=0, origin=local/attrd/42, version=0.299.11) >>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>> cib_process_request: Completed cib_modify operation for section >>> status: OK (rc=0, origin=local/attrd/43, version=0.299.11) >>> Feb 01 21:40:17 [13256] ga1-ext crmd: info: >>> register_fsa_error_adv: Resetting the current action list >>> Feb 01 21:40:17 [13256] ga1-ext crmd: warning: >>> crmd_ha_msg_filter: Another DC detected: ga2-ext (op=noop) >>> Feb 01 21:40:17 [13256] ga1-ext crmd: info: >>> register_fsa_error_adv: Resetting the current action list >>> Feb 01 21:40:17 [13256] ga1-ext crmd: warning: >>> crmd_ha_msg_filter: Another DC detected: ga2-ext (op=noop) >>> Feb 01 21:40:17 corosync [CMAN ] ais: deliver_fn source nodeid = 2, >>> len=24, endian_conv=0 >>> Feb 01 21:40:17 corosync [CMAN ] memb: Message on port 0 is 6 >>> Feb 01 21:40:17 corosync [CMAN ] memb: got KILL for node 1 >>> Feb 01 21:40:17 [13256] ga1-ext crmd: info: >>> register_fsa_error_adv: Resetting the current action list >>> Feb 01 21:40:17 [13256] ga1-ext crmd: warning: >>> crmd_ha_msg_filter: Another DC detected: ga2-ext (op=noop) >>> Feb 01 21:40:17 [13256] ga1-ext crmd: info: >>> register_fsa_error_adv: Resetting the current action list >>> Feb 01 21:40:17 [13256] ga1-ext crmd: warning: >>> crmd_ha_msg_filter: Another DC detected: ga2-ext (op=join_offer) >>> Feb 01 21:40:17 [13256] ga1-ext crmd: info: >>> do_state_transition: State transition S_INTEGRATION -> S_ELECTION [ >>> input=I_ELECTION cause=C_FSA_INTERNAL origin=crmd_ha_msg_filter ] >>> Feb 01 21:40:17 [13256] ga1-ext crmd: info: update_dc: >>> Unset DC. Was ga1-ext >>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>> cib_process_diff: Diff 0.299.9 -> 0.299.10 from ga2-ext not >>> applied to 0.299.11: current "num_updates" is greater than required >>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>> cib_process_diff: Diff 0.299.10 -> 0.299.11 from ga2-ext not >>> applied to 0.299.11: current "num_updates" is greater than required >>> Feb 01 21:40:18 [13247] ga1-ext pacemakerd: error: >>> pcmk_cpg_dispatch: Connection to the CPG API failed: Library >>> error (2) >>> Feb 01 21:40:18 [13247] ga1-ext pacemakerd: error: mcp_cpg_destroy: >>> Connection destroyed >>> Feb 01 21:40:18 [13247] ga1-ext pacemakerd: info: crm_xml_cleanup: >>> Cleaning up memory from libxml2 >>> Feb 01 21:40:18 [13255] ga1-ext attrd: error: >>> pcmk_cpg_dispatch: Connection to the CPG API failed: Library >>> error (2) >>> Feb 01 21:40:18 [13255] ga1-ext attrd: crit: >>> attrd_cs_destroy: Lost connection to Corosync service! >>> Feb 01 21:40:18 [13255] ga1-ext attrd: notice: main: Exiting... >>> Feb 01 21:40:18 [13255] ga1-ext attrd: notice: main: >>> Disconnecting client 0x238ff10, pid=13256... >>> Feb 01 21:40:18 [13255] ga1-ext attrd: error: >>> attrd_cib_connection_destroy: Connection to the CIB terminated... >>> Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info: >>> stonith_shutdown: Terminating with 1 clients >>> Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info: >>> cib_connection_destroy: Connection to the CIB closed. >>> Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info: >>> crm_client_destroy: Destroying 0 events >>> Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info: >>> qb_ipcs_us_withdraw: withdrawing server sockets >>> Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info: main: Done >>> Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info: crm_xml_cleanup: >>> Cleaning up memory from libxml2 >>> Feb 01 21:40:18 [13256] ga1-ext crmd: error: >>> pcmk_cpg_dispatch: Connection to the CPG API failed: Library >>> error (2) >>> Feb 01 21:40:18 [13256] ga1-ext crmd: error: crmd_cs_destroy: >>> connection terminated >>> Feb 01 21:40:18 [13256] ga1-ext crmd: info: >>> qb_ipcs_us_withdraw: withdrawing server sockets >>> Feb 01 21:40:18 [13253] ga1-ext cib: error: >>> pcmk_cpg_dispatch: Connection to the CPG API failed: Library >>> error (2) >>> Feb 01 21:40:18 [13253] ga1-ext cib: error: cib_cs_destroy: >>> Corosync connection lost! Exiting. >>> Feb 01 21:40:18 [13253] ga1-ext cib: info: terminate_cib: >>> cib_cs_destroy: Exiting fast... >>> Feb 01 21:40:18 [13253] ga1-ext cib: info: >>> qb_ipcs_us_withdraw: withdrawing server sockets >>> Feb 01 21:40:18 [13253] ga1-ext cib: info: >>> crm_client_destroy: Destroying 0 events >>> Feb 01 21:40:18 [13253] ga1-ext cib: info: >>> crm_client_destroy: Destroying 0 events >>> Feb 01 21:40:18 [13253] ga1-ext cib: info: >>> qb_ipcs_us_withdraw: withdrawing server sockets >>> Feb 01 21:40:18 [13253] ga1-ext cib: info: >>> crm_client_destroy: Destroying 0 events >>> Feb 01 21:40:18 [13253] ga1-ext cib: info: >>> qb_ipcs_us_withdraw: withdrawing server sockets >>> Feb 01 21:40:18 [13253] ga1-ext cib: info: crm_xml_cleanup: >>> Cleaning up memory from libxml2 >>> Feb 01 21:40:18 [13256] ga1-ext crmd: info: >>> tengine_stonith_connection_destroy: Fencing daemon disconnected >>> Feb 01 21:40:18 [13256] ga1-ext crmd: notice: crmd_exit: >>> Forcing immediate exit: Link has been severed (67) >>> Feb 01 21:40:18 [13256] ga1-ext crmd: info: crm_xml_cleanup: >>> Cleaning up memory from libxml2 >>> Feb 01 21:40:18 [25258] ga1-ext lrmd: info: >>> cancel_recurring_action: Cancelling operation ClusterIP_monitor_30000 >>> Feb 01 21:40:18 [25258] ga1-ext lrmd: warning: >>> qb_ipcs_event_sendv: new_event_notification (25258-13256-6): Bad >>> file descriptor (9) >>> Feb 01 21:40:18 [25258] ga1-ext lrmd: warning: >>> send_client_notify: Notification of client >>> crmd/0b3ea733-7340-439c-9f46-81b0d7e1f6a1 failed >>> Feb 01 21:40:18 [25258] ga1-ext lrmd: info: >>> crm_client_destroy: Destroying 1 events >>> Feb 01 21:40:18 [25260] ga1-ext pengine: info: >>> crm_client_destroy: Destroying 0 events >>> > _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss