Alessandro Bono napsal(a): > > Il 10/02/14 12:24, Jan Friesse ha scritto: >> Alessandro Bono napsal(a): >>> Il 10/02/14 10:47, Jan Friesse ha scritto: >>>> Alessandro, >>>> can you find message like "Corosync main process was not scheduled for >>>> ... ms" in log file (corosync must be at least 1.4.1-16 so CentOS 6.5)? >>> Hi >>> >>> there is no a message like that in log file >>> distro is centos 6.5 >> ok. So first thing to try is to remove redundant ring (just remove >> altname tags) and see, if problem is still existing. If so, give a try >> standard multicast (so remove udpu) but make sure to enable >> multicast_querier (echo 1 > >> /sys/class/net/$NETWORK_IFACE/bridge/multicast_querier, I'm using >> libvirt qemu hook (/etc/libvirt/hooks/qemu) for that). > redudant ring and udpu are a tentative to workaround the problem > this library error was first seen on a configuration without these > parameters > I have same problem on two cluster with similar node configuration Ok. Can you please then paste logs from single ring multicast configuration (ideally centos 6.5)? Regards, Honza >> >> Regards, >> Honza >> >>> rpm -qa corosync >>> corosync-1.4.1-17.el6.x86_64 >>> >>>> Regards, >>>> Honza >>>> >>>> Alessandro Bono napsal(a): >>>>> Hi >>>>> >>>>> after changing cluster from corosync to cman+corosync (switching from >>>>> centos 6.3 to 6.4) I have a recurring problem with pacemaker/corosync >>>>> pacemaker report this error >>>>> >>>>> pacemakerd: error: pcmk_cpg_dispatch: Connection to the CPG API >>>>> failed: Library error (2) >>>>> >>>>> and shutdown itself >>>>> This normally happens when the host machine is under high load, at >>>>> example during a full backup >>>>> >>>>> in addition, there are a lot of these messages >>>>> >>>>> Feb 01 23:27:04 corosync [TOTEM ] received message requesting test of >>>>> ring now active >>>>> Feb 01 23:27:04 corosync [TOTEM ] Automatically recovered ring 1 >>>>> Feb 01 23:27:06 corosync [TOTEM ] Marking ringid 0 interface >>>>> 10.12.32.1 >>>>> FAULTY >>>>> Feb 01 23:27:07 corosync [TOTEM ] received message requesting test of >>>>> ring now active >>>>> Feb 01 23:27:07 corosync [TOTEM ] Automatically recovered ring 0 >>>>> Feb 01 23:27:07 corosync [TOTEM ] received message requesting test of >>>>> ring now active >>>>> Feb 01 23:27:07 corosync [TOTEM ] Automatically recovered ring 0 >>>>> Feb 01 23:27:09 corosync [TOTEM ] Marking ringid 1 interface >>>>> 10.12.23.1 >>>>> FAULTY >>>>> Feb 01 23:27:10 corosync [TOTEM ] received message requesting test of >>>>> ring now active >>>>> Feb 01 23:27:10 corosync [TOTEM ] Automatically recovered ring 1 >>>>> Feb 01 23:27:10 corosync [TOTEM ] received message requesting test of >>>>> ring now active >>>>> Feb 01 23:27:10 corosync [TOTEM ] Automatically recovered ring 0 >>>>> Feb 01 23:27:12 corosync [TOTEM ] Marking ringid 1 interface >>>>> 10.12.23.1 >>>>> FAULTY >>>>> Feb 01 23:27:12 corosync [TOTEM ] Marking ringid 0 interface >>>>> 10.12.32.1 >>>>> FAULTY >>>>> Feb 01 23:27:13 corosync [TOTEM ] received message requesting test of >>>>> ring now active >>>>> Feb 01 23:27:13 corosync [TOTEM ] received message requesting test of >>>>> ring now active >>>>> >>>>> I reported this problem to pacemaker ml but they said it's a corosync >>>>> problem >>>>> same problem with centos 6.5 >>>>> >>>>> I tried to switch comunication to udpu and add another comunication >>>>> path >>>>> but without any luck >>>>> cluster nodes are kvm virtual machine >>>>> >>>>> Is it a configuration problem? >>>>> >>>>> some info below, I can provide full log if necessary >>>>> >>>>> rpm -qa | egrep "pacem|coro"| sort >>>>> corosync-1.4.1-17.el6.x86_64 >>>>> corosynclib-1.4.1-17.el6.x86_64 >>>>> drbd-pacemaker-8.3.16-1.el6.x86_64 >>>>> pacemaker-1.1.10-14.el6_5.2.x86_64 >>>>> pacemaker-cli-1.1.10-14.el6_5.2.x86_64 >>>>> pacemaker-cluster-libs-1.1.10-14.el6_5.2.x86_64 >>>>> pacemaker-debuginfo-1.1.10-1.el6.x86_64 >>>>> pacemaker-libs-1.1.10-14.el6_5.2.x86_64 >>>>> >>>>> >>>>> cat /etc/cluster/cluster.conf >>>>> <cluster config_version="8" name="ga-ext_cluster"> >>>>> <cman transport="udpu"/> >>>>> <logging> >>>>> <logging_daemon name="corosync" debug="on"/> >>>>> </logging> >>>>> <clusternodes> >>>>> <clusternode name="ga1-ext" nodeid="1"> >>>>> <fence> >>>>> <method name="pcmk-redirect"> >>>>> <device name="pcmk" port="ga1-ext"/> >>>>> </method> >>>>> </fence> >>>>> <altname name="ga1-ext_alt"/> >>>>> </clusternode> >>>>> <clusternode name="ga2-ext" nodeid="2"> >>>>> <fence> >>>>> <method name="pcmk-redirect"> >>>>> <device name="pcmk" port="ga2-ext"/> >>>>> </method> >>>>> </fence> >>>>> <altname name="ga2-ext_alt"/> >>>>> </clusternode> >>>>> </clusternodes> >>>>> <fencedevices> >>>>> <fencedevice agent="fence_pcmk" name="pcmk"/> >>>>> </fencedevices> >>>>> </cluster> >>>>> >>>>> crm configure show >>>>> node ga1-ext \ >>>>> attributes standby="off" >>>>> node ga2-ext \ >>>>> attributes standby="off" >>>>> primitive ClusterIP ocf:heartbeat:IPaddr \ >>>>> params ip="10.12.23.3" cidr_netmask="24" \ >>>>> op monitor interval="30s" >>>>> primitive SharedFS ocf:heartbeat:Filesystem \ >>>>> params device="/dev/drbd/by-res/r0" directory="/shared" >>>>> fstype="ext4" options="noatime,nobarrier" >>>>> primitive dovecot lsb:dovecot >>>>> primitive drbd0 ocf:linbit:drbd \ >>>>> params drbd_resource="r0" \ >>>>> op monitor interval="15s" >>>>> primitive drbdlinks ocf:tummy:drbdlinks >>>>> primitive mail ocf:heartbeat:MailTo \ >>>>> params email="root@xxxxxxxxxxxxxxxxxxxx" subject="ga-ext cluster >>>>> - " >>>>> primitive mysql lsb:mysqld >>>>> group service_group SharedFS drbdlinks ClusterIP mail mysql dovecot \ >>>>> meta target-role="Started" >>>>> ms ms_drbd0 drbd0 \ >>>>> meta master-max="1" master-node-max="1" clone-max="2" >>>>> clone-node-max="1" notify="true" >>>>> colocation service_on_drbd inf: service_group ms_drbd0:Master >>>>> order service_after_drbd inf: ms_drbd0:promote service_group:start >>>>> property $id="cib-bootstrap-options" \ >>>>> dc-version="1.1.10-14.el6_5.2-368c726" \ >>>>> cluster-infrastructure="cman" \ >>>>> expected-quorum-votes="2" \ >>>>> stonith-enabled="false" \ >>>>> no-quorum-policy="ignore" \ >>>>> last-lrm-refresh="1391290945" \ >>>>> maintenance-mode="false" >>>>> rsc_defaults $id="rsc-options" \ >>>>> resource-stickiness="100" >>>>> >>>>> extract from cluster.log >>>>> >>>>> Feb 01 21:40:15 corosync [MAIN ] Completed service synchronization, >>>>> ready to provide service. >>>>> Feb 01 21:40:15 corosync [TOTEM ] waiting_trans_ack changed to 0 >>>>> Feb 01 21:40:15 corosync [TOTEM ] Marking ringid 1 interface >>>>> 10.12.23.1 >>>>> FAULTY >>>>> Feb 01 21:40:15 [13253] ga1-ext cib: info: crm_cs_flush: >>>>> Sent 4 CPG messages (0 remaining, last=48): OK (1) >>>>> Feb 01 21:40:15 [13256] ga1-ext crmd: info: crm_cs_flush: >>>>> Sent 3 CPG messages (0 remaining, last=24): OK (1) >>>>> Feb 01 21:40:16 corosync [TOTEM ] received message requesting test of >>>>> ring now active >>>>> Feb 01 21:40:16 corosync [TOTEM ] received message requesting test of >>>>> ring now active >>>>> Feb 01 21:40:16 corosync [TOTEM ] received message requesting test of >>>>> ring now active >>>>> Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 0 >>>>> Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 1 >>>>> Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 1 >>>>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>>>> cib_process_diff: Diff 0.299.3 -> 0.299.4 from ga2-ext not >>>>> applied to 0.299.11: current "num_updates" is greater than required >>>>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>>>> cib_process_request: Completed cib_query operation for section >>>>> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='f >>>>> >>>>> >>>>> >>>>> ail-count-drbd0']: No such device or address (rc=-6, >>>>> origin=local/attrd/34, version=0.299.11) >>>>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>>>> cib_process_request: Completed cib_query operation for section >>>>> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l >>>>> >>>>> >>>>> >>>>> ast-failure-mysql']: No such device or address (rc=-6, >>>>> origin=local/attrd/35, version=0.299.11) >>>>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>>>> cib_process_request: Completed cib_query operation for section >>>>> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l >>>>> >>>>> >>>>> >>>>> ast-failure-drbd0']: No such device or address (rc=-6, >>>>> origin=local/attrd/36, version=0.299.11) >>>>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>>>> cib_process_diff: Diff 0.299.4 -> 0.299.5 from ga2-ext not >>>>> applied to 0.299.11: current "num_updates" is greater than required >>>>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>>>> cib_process_diff: Diff 0.299.5 -> 0.299.6 from ga2-ext not >>>>> applied to 0.299.11: current "num_updates" is greater than required >>>>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>>>> cib_process_diff: Diff 0.299.6 -> 0.299.7 from ga2-ext not >>>>> applied to 0.299.11: current "num_updates" is greater than required >>>>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>>>> cib_process_diff: Diff 0.299.7 -> 0.299.8 from ga2-ext not >>>>> applied to 0.299.11: current "num_updates" is greater than required >>>>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>>>> cib_process_diff: Diff 0.299.8 -> 0.299.9 from ga2-ext not >>>>> applied to 0.299.11: current "num_updates" is greater than required >>>>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>>>> cib_process_request: Completed cib_query operation for section >>>>> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='m >>>>> >>>>> >>>>> >>>>> aster-drbd0']: OK (rc=0, origin=local/attrd/37, version=0.299.11) >>>>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>>>> cib_process_request: Completed cib_modify operation for section >>>>> status: OK (rc=0, origin=local/attrd/38, version=0.299.11) >>>>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>>>> cib_process_request: Completed cib_query operation for section >>>>> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l >>>>> >>>>> >>>>> >>>>> ast-failure-ClusterIP']: No such device or address (rc=-6, >>>>> origin=local/attrd/39, version=0.299.11) >>>>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>>>> cib_process_request: Completed cib_query operation for section >>>>> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='p >>>>> >>>>> >>>>> >>>>> robe_complete']: OK (rc=0, origin=local/attrd/40, version=0.299.11) >>>>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>>>> cib_process_request: Completed cib_modify operation for section >>>>> status: OK (rc=0, origin=local/attrd/41, version=0.299.11) >>>>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>>>> cib_process_request: Completed cib_query operation for section >>>>> //cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='m >>>>> >>>>> >>>>> >>>>> aster-drbd0']: OK (rc=0, origin=local/attrd/42, version=0.299.11) >>>>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>>>> cib_process_request: Completed cib_modify operation for section >>>>> status: OK (rc=0, origin=local/attrd/43, version=0.299.11) >>>>> Feb 01 21:40:17 [13256] ga1-ext crmd: info: >>>>> register_fsa_error_adv: Resetting the current action list >>>>> Feb 01 21:40:17 [13256] ga1-ext crmd: warning: >>>>> crmd_ha_msg_filter: Another DC detected: ga2-ext (op=noop) >>>>> Feb 01 21:40:17 [13256] ga1-ext crmd: info: >>>>> register_fsa_error_adv: Resetting the current action list >>>>> Feb 01 21:40:17 [13256] ga1-ext crmd: warning: >>>>> crmd_ha_msg_filter: Another DC detected: ga2-ext (op=noop) >>>>> Feb 01 21:40:17 corosync [CMAN ] ais: deliver_fn source nodeid = 2, >>>>> len=24, endian_conv=0 >>>>> Feb 01 21:40:17 corosync [CMAN ] memb: Message on port 0 is 6 >>>>> Feb 01 21:40:17 corosync [CMAN ] memb: got KILL for node 1 >>>>> Feb 01 21:40:17 [13256] ga1-ext crmd: info: >>>>> register_fsa_error_adv: Resetting the current action list >>>>> Feb 01 21:40:17 [13256] ga1-ext crmd: warning: >>>>> crmd_ha_msg_filter: Another DC detected: ga2-ext (op=noop) >>>>> Feb 01 21:40:17 [13256] ga1-ext crmd: info: >>>>> register_fsa_error_adv: Resetting the current action list >>>>> Feb 01 21:40:17 [13256] ga1-ext crmd: warning: >>>>> crmd_ha_msg_filter: Another DC detected: ga2-ext (op=join_offer) >>>>> Feb 01 21:40:17 [13256] ga1-ext crmd: info: >>>>> do_state_transition: State transition S_INTEGRATION -> >>>>> S_ELECTION [ >>>>> input=I_ELECTION cause=C_FSA_INTERNAL origin=crmd_ha_msg_filter ] >>>>> Feb 01 21:40:17 [13256] ga1-ext crmd: info: update_dc: >>>>> Unset DC. Was ga1-ext >>>>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>>>> cib_process_diff: Diff 0.299.9 -> 0.299.10 from ga2-ext not >>>>> applied to 0.299.11: current "num_updates" is greater than required >>>>> Feb 01 21:40:17 [13253] ga1-ext cib: info: >>>>> cib_process_diff: Diff 0.299.10 -> 0.299.11 from ga2-ext not >>>>> applied to 0.299.11: current "num_updates" is greater than required >>>>> Feb 01 21:40:18 [13247] ga1-ext pacemakerd: error: >>>>> pcmk_cpg_dispatch: Connection to the CPG API failed: Library >>>>> error (2) >>>>> Feb 01 21:40:18 [13247] ga1-ext pacemakerd: error: mcp_cpg_destroy: >>>>> Connection destroyed >>>>> Feb 01 21:40:18 [13247] ga1-ext pacemakerd: info: crm_xml_cleanup: >>>>> Cleaning up memory from libxml2 >>>>> Feb 01 21:40:18 [13255] ga1-ext attrd: error: >>>>> pcmk_cpg_dispatch: Connection to the CPG API failed: Library >>>>> error (2) >>>>> Feb 01 21:40:18 [13255] ga1-ext attrd: crit: >>>>> attrd_cs_destroy: Lost connection to Corosync service! >>>>> Feb 01 21:40:18 [13255] ga1-ext attrd: notice: main: Exiting... >>>>> Feb 01 21:40:18 [13255] ga1-ext attrd: notice: main: >>>>> Disconnecting client 0x238ff10, pid=13256... >>>>> Feb 01 21:40:18 [13255] ga1-ext attrd: error: >>>>> attrd_cib_connection_destroy: Connection to the CIB terminated... >>>>> Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info: >>>>> stonith_shutdown: Terminating with 1 clients >>>>> Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info: >>>>> cib_connection_destroy: Connection to the CIB closed. >>>>> Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info: >>>>> crm_client_destroy: Destroying 0 events >>>>> Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info: >>>>> qb_ipcs_us_withdraw: withdrawing server sockets >>>>> Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info: main: Done >>>>> Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info: crm_xml_cleanup: >>>>> Cleaning up memory from libxml2 >>>>> Feb 01 21:40:18 [13256] ga1-ext crmd: error: >>>>> pcmk_cpg_dispatch: Connection to the CPG API failed: Library >>>>> error (2) >>>>> Feb 01 21:40:18 [13256] ga1-ext crmd: error: crmd_cs_destroy: >>>>> connection terminated >>>>> Feb 01 21:40:18 [13256] ga1-ext crmd: info: >>>>> qb_ipcs_us_withdraw: withdrawing server sockets >>>>> Feb 01 21:40:18 [13253] ga1-ext cib: error: >>>>> pcmk_cpg_dispatch: Connection to the CPG API failed: Library >>>>> error (2) >>>>> Feb 01 21:40:18 [13253] ga1-ext cib: error: cib_cs_destroy: >>>>> Corosync connection lost! Exiting. >>>>> Feb 01 21:40:18 [13253] ga1-ext cib: info: terminate_cib: >>>>> cib_cs_destroy: Exiting fast... >>>>> Feb 01 21:40:18 [13253] ga1-ext cib: info: >>>>> qb_ipcs_us_withdraw: withdrawing server sockets >>>>> Feb 01 21:40:18 [13253] ga1-ext cib: info: >>>>> crm_client_destroy: Destroying 0 events >>>>> Feb 01 21:40:18 [13253] ga1-ext cib: info: >>>>> crm_client_destroy: Destroying 0 events >>>>> Feb 01 21:40:18 [13253] ga1-ext cib: info: >>>>> qb_ipcs_us_withdraw: withdrawing server sockets >>>>> Feb 01 21:40:18 [13253] ga1-ext cib: info: >>>>> crm_client_destroy: Destroying 0 events >>>>> Feb 01 21:40:18 [13253] ga1-ext cib: info: >>>>> qb_ipcs_us_withdraw: withdrawing server sockets >>>>> Feb 01 21:40:18 [13253] ga1-ext cib: info: crm_xml_cleanup: >>>>> Cleaning up memory from libxml2 >>>>> Feb 01 21:40:18 [13256] ga1-ext crmd: info: >>>>> tengine_stonith_connection_destroy: Fencing daemon disconnected >>>>> Feb 01 21:40:18 [13256] ga1-ext crmd: notice: crmd_exit: >>>>> Forcing immediate exit: Link has been severed (67) >>>>> Feb 01 21:40:18 [13256] ga1-ext crmd: info: crm_xml_cleanup: >>>>> Cleaning up memory from libxml2 >>>>> Feb 01 21:40:18 [25258] ga1-ext lrmd: info: >>>>> cancel_recurring_action: Cancelling operation ClusterIP_monitor_30000 >>>>> Feb 01 21:40:18 [25258] ga1-ext lrmd: warning: >>>>> qb_ipcs_event_sendv: new_event_notification (25258-13256-6): Bad >>>>> file descriptor (9) >>>>> Feb 01 21:40:18 [25258] ga1-ext lrmd: warning: >>>>> send_client_notify: Notification of client >>>>> crmd/0b3ea733-7340-439c-9f46-81b0d7e1f6a1 failed >>>>> Feb 01 21:40:18 [25258] ga1-ext lrmd: info: >>>>> crm_client_destroy: Destroying 1 events >>>>> Feb 01 21:40:18 [25260] ga1-ext pengine: info: >>>>> crm_client_destroy: Destroying 0 events >>>>> > _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss