Il 10/02/14 13:55, Jan Friesse ha scritto:
Alessandro Bono napsal(a):
Il 10/02/14 12:24, Jan Friesse ha scritto:
Alessandro Bono napsal(a):
Il 10/02/14 10:47, Jan Friesse ha scritto:
Alessandro,
can you find message like "Corosync main process was not scheduled for
... ms" in log file (corosync must be at least 1.4.1-16 so CentOS 6.5)?
Hi
there is no a message like that in log file
distro is centos 6.5
ok. So first thing to try is to remove redundant ring (just remove
altname tags) and see, if problem is still existing. If so, give a try
standard multicast (so remove udpu) but make sure to enable
multicast_querier (echo 1 >
/sys/class/net/$NETWORK_IFACE/bridge/multicast_querier, I'm using
libvirt qemu hook (/etc/libvirt/hooks/qemu) for that).
redudant ring and udpu are a tentative to workaround the problem
this library error was first seen on a configuration without these
parameters
I have same problem on two cluster with similar node configuration
Ok. Can you please then paste logs from single ring multicast
configuration (ideally centos 6.5)?
I have to find some old log on my tape backup, not easy now
I have a full debug file but for centos 6.4 it's 164k, I'll send you
offlist
Regards,
Honza
Regards,
Honza
rpm -qa corosync
corosync-1.4.1-17.el6.x86_64
Regards,
Honza
Alessandro Bono napsal(a):
Hi
after changing cluster from corosync to cman+corosync (switching from
centos 6.3 to 6.4) I have a recurring problem with pacemaker/corosync
pacemaker report this error
pacemakerd: error: pcmk_cpg_dispatch: Connection to the CPG API
failed: Library error (2)
and shutdown itself
This normally happens when the host machine is under high load, at
example during a full backup
in addition, there are a lot of these messages
Feb 01 23:27:04 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 23:27:04 corosync [TOTEM ] Automatically recovered ring 1
Feb 01 23:27:06 corosync [TOTEM ] Marking ringid 0 interface
10.12.32.1
FAULTY
Feb 01 23:27:07 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 23:27:07 corosync [TOTEM ] Automatically recovered ring 0
Feb 01 23:27:07 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 23:27:07 corosync [TOTEM ] Automatically recovered ring 0
Feb 01 23:27:09 corosync [TOTEM ] Marking ringid 1 interface
10.12.23.1
FAULTY
Feb 01 23:27:10 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 23:27:10 corosync [TOTEM ] Automatically recovered ring 1
Feb 01 23:27:10 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 23:27:10 corosync [TOTEM ] Automatically recovered ring 0
Feb 01 23:27:12 corosync [TOTEM ] Marking ringid 1 interface
10.12.23.1
FAULTY
Feb 01 23:27:12 corosync [TOTEM ] Marking ringid 0 interface
10.12.32.1
FAULTY
Feb 01 23:27:13 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 23:27:13 corosync [TOTEM ] received message requesting test of
ring now active
I reported this problem to pacemaker ml but they said it's a corosync
problem
same problem with centos 6.5
I tried to switch comunication to udpu and add another comunication
path
but without any luck
cluster nodes are kvm virtual machine
Is it a configuration problem?
some info below, I can provide full log if necessary
rpm -qa | egrep "pacem|coro"| sort
corosync-1.4.1-17.el6.x86_64
corosynclib-1.4.1-17.el6.x86_64
drbd-pacemaker-8.3.16-1.el6.x86_64
pacemaker-1.1.10-14.el6_5.2.x86_64
pacemaker-cli-1.1.10-14.el6_5.2.x86_64
pacemaker-cluster-libs-1.1.10-14.el6_5.2.x86_64
pacemaker-debuginfo-1.1.10-1.el6.x86_64
pacemaker-libs-1.1.10-14.el6_5.2.x86_64
cat /etc/cluster/cluster.conf
<cluster config_version="8" name="ga-ext_cluster">
<cman transport="udpu"/>
<logging>
<logging_daemon name="corosync" debug="on"/>
</logging>
<clusternodes>
<clusternode name="ga1-ext" nodeid="1">
<fence>
<method name="pcmk-redirect">
<device name="pcmk" port="ga1-ext"/>
</method>
</fence>
<altname name="ga1-ext_alt"/>
</clusternode>
<clusternode name="ga2-ext" nodeid="2">
<fence>
<method name="pcmk-redirect">
<device name="pcmk" port="ga2-ext"/>
</method>
</fence>
<altname name="ga2-ext_alt"/>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice agent="fence_pcmk" name="pcmk"/>
</fencedevices>
</cluster>
crm configure show
node ga1-ext \
attributes standby="off"
node ga2-ext \
attributes standby="off"
primitive ClusterIP ocf:heartbeat:IPaddr \
params ip="10.12.23.3" cidr_netmask="24" \
op monitor interval="30s"
primitive SharedFS ocf:heartbeat:Filesystem \
params device="/dev/drbd/by-res/r0" directory="/shared"
fstype="ext4" options="noatime,nobarrier"
primitive dovecot lsb:dovecot
primitive drbd0 ocf:linbit:drbd \
params drbd_resource="r0" \
op monitor interval="15s"
primitive drbdlinks ocf:tummy:drbdlinks
primitive mail ocf:heartbeat:MailTo \
params email="root@xxxxxxxxxxxxxxxxxxxx" subject="ga-ext cluster
- "
primitive mysql lsb:mysqld
group service_group SharedFS drbdlinks ClusterIP mail mysql dovecot \
meta target-role="Started"
ms ms_drbd0 drbd0 \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true"
colocation service_on_drbd inf: service_group ms_drbd0:Master
order service_after_drbd inf: ms_drbd0:promote service_group:start
property $id="cib-bootstrap-options" \
dc-version="1.1.10-14.el6_5.2-368c726" \
cluster-infrastructure="cman" \
expected-quorum-votes="2" \
stonith-enabled="false" \
no-quorum-policy="ignore" \
last-lrm-refresh="1391290945" \
maintenance-mode="false"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"
extract from cluster.log
Feb 01 21:40:15 corosync [MAIN ] Completed service synchronization,
ready to provide service.
Feb 01 21:40:15 corosync [TOTEM ] waiting_trans_ack changed to 0
Feb 01 21:40:15 corosync [TOTEM ] Marking ringid 1 interface
10.12.23.1
FAULTY
Feb 01 21:40:15 [13253] ga1-ext cib: info: crm_cs_flush:
Sent 4 CPG messages (0 remaining, last=48): OK (1)
Feb 01 21:40:15 [13256] ga1-ext crmd: info: crm_cs_flush:
Sent 3 CPG messages (0 remaining, last=24): OK (1)
Feb 01 21:40:16 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 21:40:16 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 21:40:16 corosync [TOTEM ] received message requesting test of
ring now active
Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 0
Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 1
Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 1
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_diff: Diff 0.299.3 -> 0.299.4 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_request: Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='f
ail-count-drbd0']: No such device or address (rc=-6,
origin=local/attrd/34, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_request: Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l
ast-failure-mysql']: No such device or address (rc=-6,
origin=local/attrd/35, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_request: Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l
ast-failure-drbd0']: No such device or address (rc=-6,
origin=local/attrd/36, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_diff: Diff 0.299.4 -> 0.299.5 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_diff: Diff 0.299.5 -> 0.299.6 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_diff: Diff 0.299.6 -> 0.299.7 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_diff: Diff 0.299.7 -> 0.299.8 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_diff: Diff 0.299.8 -> 0.299.9 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_request: Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='m
aster-drbd0']: OK (rc=0, origin=local/attrd/37, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_request: Completed cib_modify operation for section
status: OK (rc=0, origin=local/attrd/38, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_request: Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l
ast-failure-ClusterIP']: No such device or address (rc=-6,
origin=local/attrd/39, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_request: Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='p
robe_complete']: OK (rc=0, origin=local/attrd/40, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_request: Completed cib_modify operation for section
status: OK (rc=0, origin=local/attrd/41, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_request: Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='m
aster-drbd0']: OK (rc=0, origin=local/attrd/42, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_request: Completed cib_modify operation for section
status: OK (rc=0, origin=local/attrd/43, version=0.299.11)
Feb 01 21:40:17 [13256] ga1-ext crmd: info:
register_fsa_error_adv: Resetting the current action list
Feb 01 21:40:17 [13256] ga1-ext crmd: warning:
crmd_ha_msg_filter: Another DC detected: ga2-ext (op=noop)
Feb 01 21:40:17 [13256] ga1-ext crmd: info:
register_fsa_error_adv: Resetting the current action list
Feb 01 21:40:17 [13256] ga1-ext crmd: warning:
crmd_ha_msg_filter: Another DC detected: ga2-ext (op=noop)
Feb 01 21:40:17 corosync [CMAN ] ais: deliver_fn source nodeid = 2,
len=24, endian_conv=0
Feb 01 21:40:17 corosync [CMAN ] memb: Message on port 0 is 6
Feb 01 21:40:17 corosync [CMAN ] memb: got KILL for node 1
Feb 01 21:40:17 [13256] ga1-ext crmd: info:
register_fsa_error_adv: Resetting the current action list
Feb 01 21:40:17 [13256] ga1-ext crmd: warning:
crmd_ha_msg_filter: Another DC detected: ga2-ext (op=noop)
Feb 01 21:40:17 [13256] ga1-ext crmd: info:
register_fsa_error_adv: Resetting the current action list
Feb 01 21:40:17 [13256] ga1-ext crmd: warning:
crmd_ha_msg_filter: Another DC detected: ga2-ext (op=join_offer)
Feb 01 21:40:17 [13256] ga1-ext crmd: info:
do_state_transition: State transition S_INTEGRATION ->
S_ELECTION [
input=I_ELECTION cause=C_FSA_INTERNAL origin=crmd_ha_msg_filter ]
Feb 01 21:40:17 [13256] ga1-ext crmd: info: update_dc:
Unset DC. Was ga1-ext
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_diff: Diff 0.299.9 -> 0.299.10 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_diff: Diff 0.299.10 -> 0.299.11 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:18 [13247] ga1-ext pacemakerd: error:
pcmk_cpg_dispatch: Connection to the CPG API failed: Library
error (2)
Feb 01 21:40:18 [13247] ga1-ext pacemakerd: error: mcp_cpg_destroy:
Connection destroyed
Feb 01 21:40:18 [13247] ga1-ext pacemakerd: info: crm_xml_cleanup:
Cleaning up memory from libxml2
Feb 01 21:40:18 [13255] ga1-ext attrd: error:
pcmk_cpg_dispatch: Connection to the CPG API failed: Library
error (2)
Feb 01 21:40:18 [13255] ga1-ext attrd: crit:
attrd_cs_destroy: Lost connection to Corosync service!
Feb 01 21:40:18 [13255] ga1-ext attrd: notice: main: Exiting...
Feb 01 21:40:18 [13255] ga1-ext attrd: notice: main:
Disconnecting client 0x238ff10, pid=13256...
Feb 01 21:40:18 [13255] ga1-ext attrd: error:
attrd_cib_connection_destroy: Connection to the CIB terminated...
Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info:
stonith_shutdown: Terminating with 1 clients
Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info:
cib_connection_destroy: Connection to the CIB closed.
Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info:
crm_client_destroy: Destroying 0 events
Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info:
qb_ipcs_us_withdraw: withdrawing server sockets
Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info: main: Done
Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info: crm_xml_cleanup:
Cleaning up memory from libxml2
Feb 01 21:40:18 [13256] ga1-ext crmd: error:
pcmk_cpg_dispatch: Connection to the CPG API failed: Library
error (2)
Feb 01 21:40:18 [13256] ga1-ext crmd: error: crmd_cs_destroy:
connection terminated
Feb 01 21:40:18 [13256] ga1-ext crmd: info:
qb_ipcs_us_withdraw: withdrawing server sockets
Feb 01 21:40:18 [13253] ga1-ext cib: error:
pcmk_cpg_dispatch: Connection to the CPG API failed: Library
error (2)
Feb 01 21:40:18 [13253] ga1-ext cib: error: cib_cs_destroy:
Corosync connection lost! Exiting.
Feb 01 21:40:18 [13253] ga1-ext cib: info: terminate_cib:
cib_cs_destroy: Exiting fast...
Feb 01 21:40:18 [13253] ga1-ext cib: info:
qb_ipcs_us_withdraw: withdrawing server sockets
Feb 01 21:40:18 [13253] ga1-ext cib: info:
crm_client_destroy: Destroying 0 events
Feb 01 21:40:18 [13253] ga1-ext cib: info:
crm_client_destroy: Destroying 0 events
Feb 01 21:40:18 [13253] ga1-ext cib: info:
qb_ipcs_us_withdraw: withdrawing server sockets
Feb 01 21:40:18 [13253] ga1-ext cib: info:
crm_client_destroy: Destroying 0 events
Feb 01 21:40:18 [13253] ga1-ext cib: info:
qb_ipcs_us_withdraw: withdrawing server sockets
Feb 01 21:40:18 [13253] ga1-ext cib: info: crm_xml_cleanup:
Cleaning up memory from libxml2
Feb 01 21:40:18 [13256] ga1-ext crmd: info:
tengine_stonith_connection_destroy: Fencing daemon disconnected
Feb 01 21:40:18 [13256] ga1-ext crmd: notice: crmd_exit:
Forcing immediate exit: Link has been severed (67)
Feb 01 21:40:18 [13256] ga1-ext crmd: info: crm_xml_cleanup:
Cleaning up memory from libxml2
Feb 01 21:40:18 [25258] ga1-ext lrmd: info:
cancel_recurring_action: Cancelling operation ClusterIP_monitor_30000
Feb 01 21:40:18 [25258] ga1-ext lrmd: warning:
qb_ipcs_event_sendv: new_event_notification (25258-13256-6): Bad
file descriptor (9)
Feb 01 21:40:18 [25258] ga1-ext lrmd: warning:
send_client_notify: Notification of client
crmd/0b3ea733-7340-439c-9f46-81b0d7e1f6a1 failed
Feb 01 21:40:18 [25258] ga1-ext lrmd: info:
crm_client_destroy: Destroying 1 events
Feb 01 21:40:18 [25260] ga1-ext pengine: info:
crm_client_destroy: Destroying 0 events
--
Cordiali saluti
Alessandro Bono
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss