Il 10/02/14 15:55, Jan Friesse ha scritto:
Alessandro Bono napsal(a):
Il 10/02/14 13:55, Jan Friesse ha scritto:
Alessandro Bono napsal(a):
Il 10/02/14 12:24, Jan Friesse ha scritto:
Alessandro Bono napsal(a):
Il 10/02/14 10:47, Jan Friesse ha scritto:
Alessandro,
can you find message like "Corosync main process was not scheduled
for
... ms" in log file (corosync must be at least 1.4.1-16 so CentOS
6.5)?
Hi
there is no a message like that in log file
distro is centos 6.5
ok. So first thing to try is to remove redundant ring (just remove
altname tags) and see, if problem is still existing. If so, give a try
standard multicast (so remove udpu) but make sure to enable
multicast_querier (echo 1 >
/sys/class/net/$NETWORK_IFACE/bridge/multicast_querier, I'm using
libvirt qemu hook (/etc/libvirt/hooks/qemu) for that).
redudant ring and udpu are a tentative to workaround the problem
this library error was first seen on a configuration without these
parameters
I have same problem on two cluster with similar node configuration
Ok. Can you please then paste logs from single ring multicast
configuration (ideally centos 6.5)?
I have to find some old log on my tape backup, not easy now
I have a full debug file but for centos 6.4 it's 164k, I'll send you
offlist
Ok, but I would still really like to see log from 6.5 (there were huge
amount of fixes for 6.5).
I don't have a log with centos 6.5, I reconfigured cluster as you requested
<cluster config_version="8" name="ga-ext_cluster">
<logging>
<logging_daemon name="corosync" debug="on"/>
</logging>
<clusternodes>
<clusternode name="ga1-ext" nodeid="1">
<fence>
<method name="pcmk-redirect">
<device name="pcmk" port="ga1-ext"/>
</method>
</fence>
</clusternode>
<clusternode name="ga2-ext" nodeid="2">
<fence>
<method name="pcmk-redirect">
<device name="pcmk" port="ga2-ext"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice agent="fence_pcmk" name="pcmk"/>
</fencedevices>
</cluster>
tonight I'll force a full backup to cause corosync error and I'll send
you log file
But from log I've seen no bigger problem. I mean, there were 1 split
caused probably by the fact that VM was not scheduled (22:26:25). Other
then that, log looks quite ok. There is really no universal solution for
VM not scheduled scenario. You can lower priority of backup
script/higher priority for VM, pin VM on CPU, ... but it may or may be
not help.
Ok but on the some machines there is another cluster with old ubuntu vm
with corosync 1.4.2 and it's rock solid
also prior to switch to cman cluster worked perfectly
to workaround maybe I'll put cluster in maintance mode and stop
pacemaker before start backup, but it's not a nice thing
Honza
Regards,
Honza
Regards,
Honza
rpm -qa corosync
corosync-1.4.1-17.el6.x86_64
Regards,
Honza
Alessandro Bono napsal(a):
Hi
after changing cluster from corosync to cman+corosync (switching
from
centos 6.3 to 6.4) I have a recurring problem with
pacemaker/corosync
pacemaker report this error
pacemakerd: error: pcmk_cpg_dispatch: Connection to the
CPG API
failed: Library error (2)
and shutdown itself
This normally happens when the host machine is under high load, at
example during a full backup
in addition, there are a lot of these messages
Feb 01 23:27:04 corosync [TOTEM ] received message requesting
test of
ring now active
Feb 01 23:27:04 corosync [TOTEM ] Automatically recovered ring 1
Feb 01 23:27:06 corosync [TOTEM ] Marking ringid 0 interface
10.12.32.1
FAULTY
Feb 01 23:27:07 corosync [TOTEM ] received message requesting
test of
ring now active
Feb 01 23:27:07 corosync [TOTEM ] Automatically recovered ring 0
Feb 01 23:27:07 corosync [TOTEM ] received message requesting
test of
ring now active
Feb 01 23:27:07 corosync [TOTEM ] Automatically recovered ring 0
Feb 01 23:27:09 corosync [TOTEM ] Marking ringid 1 interface
10.12.23.1
FAULTY
Feb 01 23:27:10 corosync [TOTEM ] received message requesting
test of
ring now active
Feb 01 23:27:10 corosync [TOTEM ] Automatically recovered ring 1
Feb 01 23:27:10 corosync [TOTEM ] received message requesting
test of
ring now active
Feb 01 23:27:10 corosync [TOTEM ] Automatically recovered ring 0
Feb 01 23:27:12 corosync [TOTEM ] Marking ringid 1 interface
10.12.23.1
FAULTY
Feb 01 23:27:12 corosync [TOTEM ] Marking ringid 0 interface
10.12.32.1
FAULTY
Feb 01 23:27:13 corosync [TOTEM ] received message requesting
test of
ring now active
Feb 01 23:27:13 corosync [TOTEM ] received message requesting
test of
ring now active
I reported this problem to pacemaker ml but they said it's a
corosync
problem
same problem with centos 6.5
I tried to switch comunication to udpu and add another comunication
path
but without any luck
cluster nodes are kvm virtual machine
Is it a configuration problem?
some info below, I can provide full log if necessary
rpm -qa | egrep "pacem|coro"| sort
corosync-1.4.1-17.el6.x86_64
corosynclib-1.4.1-17.el6.x86_64
drbd-pacemaker-8.3.16-1.el6.x86_64
pacemaker-1.1.10-14.el6_5.2.x86_64
pacemaker-cli-1.1.10-14.el6_5.2.x86_64
pacemaker-cluster-libs-1.1.10-14.el6_5.2.x86_64
pacemaker-debuginfo-1.1.10-1.el6.x86_64
pacemaker-libs-1.1.10-14.el6_5.2.x86_64
cat /etc/cluster/cluster.conf
<cluster config_version="8" name="ga-ext_cluster">
<cman transport="udpu"/>
<logging>
<logging_daemon name="corosync" debug="on"/>
</logging>
<clusternodes>
<clusternode name="ga1-ext" nodeid="1">
<fence>
<method name="pcmk-redirect">
<device name="pcmk" port="ga1-ext"/>
</method>
</fence>
<altname name="ga1-ext_alt"/>
</clusternode>
<clusternode name="ga2-ext" nodeid="2">
<fence>
<method name="pcmk-redirect">
<device name="pcmk" port="ga2-ext"/>
</method>
</fence>
<altname name="ga2-ext_alt"/>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice agent="fence_pcmk" name="pcmk"/>
</fencedevices>
</cluster>
crm configure show
node ga1-ext \
attributes standby="off"
node ga2-ext \
attributes standby="off"
primitive ClusterIP ocf:heartbeat:IPaddr \
params ip="10.12.23.3" cidr_netmask="24" \
op monitor interval="30s"
primitive SharedFS ocf:heartbeat:Filesystem \
params device="/dev/drbd/by-res/r0" directory="/shared"
fstype="ext4" options="noatime,nobarrier"
primitive dovecot lsb:dovecot
primitive drbd0 ocf:linbit:drbd \
params drbd_resource="r0" \
op monitor interval="15s"
primitive drbdlinks ocf:tummy:drbdlinks
primitive mail ocf:heartbeat:MailTo \
params email="root@xxxxxxxxxxxxxxxxxxxx" subject="ga-ext
cluster
- "
primitive mysql lsb:mysqld
group service_group SharedFS drbdlinks ClusterIP mail mysql
dovecot \
meta target-role="Started"
ms ms_drbd0 drbd0 \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true"
colocation service_on_drbd inf: service_group ms_drbd0:Master
order service_after_drbd inf: ms_drbd0:promote service_group:start
property $id="cib-bootstrap-options" \
dc-version="1.1.10-14.el6_5.2-368c726" \
cluster-infrastructure="cman" \
expected-quorum-votes="2" \
stonith-enabled="false" \
no-quorum-policy="ignore" \
last-lrm-refresh="1391290945" \
maintenance-mode="false"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"
extract from cluster.log
Feb 01 21:40:15 corosync [MAIN ] Completed service synchronization,
ready to provide service.
Feb 01 21:40:15 corosync [TOTEM ] waiting_trans_ack changed to 0
Feb 01 21:40:15 corosync [TOTEM ] Marking ringid 1 interface
10.12.23.1
FAULTY
Feb 01 21:40:15 [13253] ga1-ext cib: info: crm_cs_flush:
Sent 4 CPG messages (0 remaining, last=48): OK (1)
Feb 01 21:40:15 [13256] ga1-ext crmd: info: crm_cs_flush:
Sent 3 CPG messages (0 remaining, last=24): OK (1)
Feb 01 21:40:16 corosync [TOTEM ] received message requesting
test of
ring now active
Feb 01 21:40:16 corosync [TOTEM ] received message requesting
test of
ring now active
Feb 01 21:40:16 corosync [TOTEM ] received message requesting
test of
ring now active
Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 0
Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 1
Feb 01 21:40:16 corosync [TOTEM ] Automatically recovered ring 1
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_diff: Diff 0.299.3 -> 0.299.4 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_request: Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='f
ail-count-drbd0']: No such device or address (rc=-6,
origin=local/attrd/34, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_request: Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l
ast-failure-mysql']: No such device or address (rc=-6,
origin=local/attrd/35, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_request: Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l
ast-failure-drbd0']: No such device or address (rc=-6,
origin=local/attrd/36, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_diff: Diff 0.299.4 -> 0.299.5 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_diff: Diff 0.299.5 -> 0.299.6 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_diff: Diff 0.299.6 -> 0.299.7 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_diff: Diff 0.299.7 -> 0.299.8 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_diff: Diff 0.299.8 -> 0.299.9 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_request: Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='m
aster-drbd0']: OK (rc=0, origin=local/attrd/37, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_request: Completed cib_modify operation for section
status: OK (rc=0, origin=local/attrd/38, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_request: Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='l
ast-failure-ClusterIP']: No such device or address (rc=-6,
origin=local/attrd/39, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_request: Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='p
robe_complete']: OK (rc=0, origin=local/attrd/40, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_request: Completed cib_modify operation for section
status: OK (rc=0, origin=local/attrd/41, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_request: Completed cib_query operation for section
//cib/status//node_state[@id='ga1-ext']//transient_attributes//nvpair[@name='m
aster-drbd0']: OK (rc=0, origin=local/attrd/42, version=0.299.11)
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_request: Completed cib_modify operation for section
status: OK (rc=0, origin=local/attrd/43, version=0.299.11)
Feb 01 21:40:17 [13256] ga1-ext crmd: info:
register_fsa_error_adv: Resetting the current action list
Feb 01 21:40:17 [13256] ga1-ext crmd: warning:
crmd_ha_msg_filter: Another DC detected: ga2-ext (op=noop)
Feb 01 21:40:17 [13256] ga1-ext crmd: info:
register_fsa_error_adv: Resetting the current action list
Feb 01 21:40:17 [13256] ga1-ext crmd: warning:
crmd_ha_msg_filter: Another DC detected: ga2-ext (op=noop)
Feb 01 21:40:17 corosync [CMAN ] ais: deliver_fn source nodeid = 2,
len=24, endian_conv=0
Feb 01 21:40:17 corosync [CMAN ] memb: Message on port 0 is 6
Feb 01 21:40:17 corosync [CMAN ] memb: got KILL for node 1
Feb 01 21:40:17 [13256] ga1-ext crmd: info:
register_fsa_error_adv: Resetting the current action list
Feb 01 21:40:17 [13256] ga1-ext crmd: warning:
crmd_ha_msg_filter: Another DC detected: ga2-ext (op=noop)
Feb 01 21:40:17 [13256] ga1-ext crmd: info:
register_fsa_error_adv: Resetting the current action list
Feb 01 21:40:17 [13256] ga1-ext crmd: warning:
crmd_ha_msg_filter: Another DC detected: ga2-ext
(op=join_offer)
Feb 01 21:40:17 [13256] ga1-ext crmd: info:
do_state_transition: State transition S_INTEGRATION ->
S_ELECTION [
input=I_ELECTION cause=C_FSA_INTERNAL origin=crmd_ha_msg_filter ]
Feb 01 21:40:17 [13256] ga1-ext crmd: info: update_dc:
Unset DC. Was ga1-ext
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_diff: Diff 0.299.9 -> 0.299.10 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:17 [13253] ga1-ext cib: info:
cib_process_diff: Diff 0.299.10 -> 0.299.11 from ga2-ext not
applied to 0.299.11: current "num_updates" is greater than required
Feb 01 21:40:18 [13247] ga1-ext pacemakerd: error:
pcmk_cpg_dispatch: Connection to the CPG API failed: Library
error (2)
Feb 01 21:40:18 [13247] ga1-ext pacemakerd: error:
mcp_cpg_destroy:
Connection destroyed
Feb 01 21:40:18 [13247] ga1-ext pacemakerd: info:
crm_xml_cleanup:
Cleaning up memory from libxml2
Feb 01 21:40:18 [13255] ga1-ext attrd: error:
pcmk_cpg_dispatch: Connection to the CPG API failed: Library
error (2)
Feb 01 21:40:18 [13255] ga1-ext attrd: crit:
attrd_cs_destroy: Lost connection to Corosync service!
Feb 01 21:40:18 [13255] ga1-ext attrd: notice: main:
Exiting...
Feb 01 21:40:18 [13255] ga1-ext attrd: notice: main:
Disconnecting client 0x238ff10, pid=13256...
Feb 01 21:40:18 [13255] ga1-ext attrd: error:
attrd_cib_connection_destroy: Connection to the CIB
terminated...
Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info:
stonith_shutdown: Terminating with 1 clients
Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info:
cib_connection_destroy: Connection to the CIB closed.
Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info:
crm_client_destroy: Destroying 0 events
Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info:
qb_ipcs_us_withdraw: withdrawing server sockets
Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info: main: Done
Feb 01 21:40:18 [13254] ga1-ext stonith-ng: info:
crm_xml_cleanup:
Cleaning up memory from libxml2
Feb 01 21:40:18 [13256] ga1-ext crmd: error:
pcmk_cpg_dispatch: Connection to the CPG API failed: Library
error (2)
Feb 01 21:40:18 [13256] ga1-ext crmd: error:
crmd_cs_destroy:
connection terminated
Feb 01 21:40:18 [13256] ga1-ext crmd: info:
qb_ipcs_us_withdraw: withdrawing server sockets
Feb 01 21:40:18 [13253] ga1-ext cib: error:
pcmk_cpg_dispatch: Connection to the CPG API failed: Library
error (2)
Feb 01 21:40:18 [13253] ga1-ext cib: error:
cib_cs_destroy:
Corosync connection lost! Exiting.
Feb 01 21:40:18 [13253] ga1-ext cib: info: terminate_cib:
cib_cs_destroy: Exiting fast...
Feb 01 21:40:18 [13253] ga1-ext cib: info:
qb_ipcs_us_withdraw: withdrawing server sockets
Feb 01 21:40:18 [13253] ga1-ext cib: info:
crm_client_destroy: Destroying 0 events
Feb 01 21:40:18 [13253] ga1-ext cib: info:
crm_client_destroy: Destroying 0 events
Feb 01 21:40:18 [13253] ga1-ext cib: info:
qb_ipcs_us_withdraw: withdrawing server sockets
Feb 01 21:40:18 [13253] ga1-ext cib: info:
crm_client_destroy: Destroying 0 events
Feb 01 21:40:18 [13253] ga1-ext cib: info:
qb_ipcs_us_withdraw: withdrawing server sockets
Feb 01 21:40:18 [13253] ga1-ext cib: info:
crm_xml_cleanup:
Cleaning up memory from libxml2
Feb 01 21:40:18 [13256] ga1-ext crmd: info:
tengine_stonith_connection_destroy: Fencing daemon
disconnected
Feb 01 21:40:18 [13256] ga1-ext crmd: notice: crmd_exit:
Forcing immediate exit: Link has been severed (67)
Feb 01 21:40:18 [13256] ga1-ext crmd: info:
crm_xml_cleanup:
Cleaning up memory from libxml2
Feb 01 21:40:18 [25258] ga1-ext lrmd: info:
cancel_recurring_action: Cancelling operation
ClusterIP_monitor_30000
Feb 01 21:40:18 [25258] ga1-ext lrmd: warning:
qb_ipcs_event_sendv: new_event_notification (25258-13256-6):
Bad
file descriptor (9)
Feb 01 21:40:18 [25258] ga1-ext lrmd: warning:
send_client_notify: Notification of client
crmd/0b3ea733-7340-439c-9f46-81b0d7e1f6a1 failed
Feb 01 21:40:18 [25258] ga1-ext lrmd: info:
crm_client_destroy: Destroying 1 events
Feb 01 21:40:18 [25260] ga1-ext pengine: info:
crm_client_destroy: Destroying 0 events
--
Cordiali saluti
Alessandro Bono
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss