Help on debugging corosync communication

"Adrian Gibanel " <adrian.gibanel@xxxxxxxxxxx> · Mon, 20 Aug 2012 11:16:47 +0200 (CEST)

Short description 
--------------------- 
HA nodes doesn't seem to communicate to each other via corosync 

Final goal 
------------ 
Being able to HA zimbra as an 2-node active/pasive cluster.

Description of the system 
----------------------------------- 
This is an Ubuntu 10.04 LTS because current stable Zimbra works in Ubuntu 10.04 and not yet in 12.04. 

I've dist-upgraded packages from: https://launchpad.net/~ubuntu-ha-maintainers/+archive/ppa as it was advised on some sites. 

My main configuration is based on this document: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 

I've created some OCF resource agents (for zimbra and some network stuff) on my own and I've already tested them thanks to ocf-tester and ocf-tester-py (a hack of mine of ocf-tester that allows you to test python based ocf scripts). 

Finally some packages versions: 

libcrmcluster1 1.1.6-2ubuntu0~ppa2 
libcrmcommon2 1.1.6-2ubuntu0~ppa2 
corosync 1.4.2-1ubuntu0~ppa1 
libcorosync4 1.4.2-1ubuntu0~ppa1 
lvm2 2.02.54-1ubuntu4.1ppa5 
pacemaker 1.1.6-2ubuntu0~ppa2 
libglib2.0-0 2.24.1-0ubuntu1.1~ppa1 
lvm2 2.02.54-1ubuntu4.1ppa5 
cluster-glue 1.0.8-2ubuntu0~ppa4 
libcluster-glue 1.0.8-2ubuntu0~ppa4 
resource-agents 1:3.9.2-4ubuntu0~ppa2 

Node 1 - corosync-objctl runtime.totem.pg.mrp.srp.members 
------------------------------------------------------------------------------- 

runtime.totem.pg.mrp.srp.171616448.ip=r(0) ip(192.168.58.10) 
runtime.totem.pg.mrp.srp.171616448.join_count=1 
runtime.totem.pg.mrp.srp.171616448.status=joined 

Node 2 - corosync-objctl runtime.totem.pg.mrp.srp.members 
------------------------------------------------------------------------------- 

runtime.totem.pg.mrp.srp.171616448.ip=r(0) ip(192.168.58.10) 
runtime.totem.pg.mrp.srp.171616448.join_count=1 
runtime.totem.pg.mrp.srp.171616448.status=joined 

Node 1 - tcpdump -envv "port 5405" -i eth1 
------------------------------------------------------------ 

10:51:56.748054 0a:00:27:00:00:02 > 08:00:27:26:45:5b, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 61, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
2.4.6.8.37357 > 192.168.58.10.5405: UDP, length 82 
10:51:56.914846 08:00:27:26:45:5b > 0a:00:27:00:00:02, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
192.168.58.10.34890 > 2.4.6.8.5405: UDP, length 82 
10:51:57.087184 0a:00:27:00:00:02 > 08:00:27:26:45:5b, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 61, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
2.4.6.8.37357 > 192.168.58.10.5405: UDP, length 82 
10:51:57.137976 08:00:27:26:45:5b > 0a:00:27:00:00:02, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
192.168.58.10.34890 > 2.4.6.8.5405: UDP, length 82 
10:51:57.339116 0a:00:27:00:00:02 > 08:00:27:26:45:5b, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 61, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
2.4.6.8.37357 > 192.168.58.10.5405: UDP, length 82 
10:51:57.505602 08:00:27:26:45:5b > 0a:00:27:00:00:02, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
192.168.58.10.34890 > 2.4.6.8.5405: UDP, length 82 
10:51:57.709958 0a:00:27:00:00:02 > 08:00:27:26:45:5b, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 61, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
2.4.6.8.37357 > 192.168.58.10.5405: UDP, length 82 
10:51:57.728345 08:00:27:26:45:5b > 0a:00:27:00:00:02, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
192.168.58.10.34890 > 2.4.6.8.5405: UDP, length 82 
10:51:57.962354 0a:00:27:00:00:02 > 08:00:27:26:45:5b, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 61, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
2.4.6.8.37357 > 192.168.58.10.5405: UDP, length 82 
10:51:58.094887 08:00:27:26:45:5b > 0a:00:27:00:00:02, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
192.168.58.10.34890 > 2.4.6.8.5405: UDP, length 82 
10:51:58.301512 0a:00:27:00:00:02 > 08:00:27:26:45:5b, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 61, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
2.4.6.8.37357 > 192.168.58.10.5405: UDP, length 82 
10:51:58.301657 08:00:27:26:45:5b > 0a:00:27:00:00:02, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
192.168.58.10.34890 > 2.4.6.8.5405: UDP, length 82 
10:51:58.576392 0a:00:27:00:00:02 > 08:00:27:26:45:5b, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 61, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
2.4.6.8.37357 > 192.168.58.10.5405: UDP, length 82 
10:51:58.684891 08:00:27:26:45:5b > 0a:00:27:00:00:02, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
192.168.58.10.34890 > 2.4.6.8.5405: UDP, length 82 
10:51:58.894880 08:00:27:26:45:5b > 0a:00:27:00:00:02, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
192.168.58.10.34890 > 2.4.6.8.5405: UDP, length 82 
10:51:58.914985 0a:00:27:00:00:02 > 08:00:27:26:45:5b, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 61, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
2.4.6.8.37357 > 192.168.58.10.5405: UDP, length 82 
10:51:59.168176 0a:00:27:00:00:02 > 08:00:27:26:45:5b, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 61, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
2.4.6.8.37357 > 192.168.58.10.5405: UDP, length 82 
10:51:59.275154 08:00:27:26:45:5b > 0a:00:27:00:00:02, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
192.168.58.10.34890 > 2.4.6.8.5405: UDP, length 82 
10:51:59.485189 08:00:27:26:45:5b > 0a:00:27:00:00:02, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
192.168.58.10.34890 > 2.4.6.8.5405: UDP, length 82 
10:51:59.514365 0a:00:27:00:00:02 > 08:00:27:26:45:5b, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 61, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
2.4.6.8.37357 > 192.168.58.10.5405: UDP, length 82 
10:51:59.775556 0a:00:27:00:00:02 > 08:00:27:26:45:5b, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 61, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
2.4.6.8.37357 > 192.168.58.10.5405: UDP, length 82 
10:51:59.864912 08:00:27:26:45:5b > 0a:00:27:00:00:02, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
192.168.58.10.34890 > 2.4.6.8.5405: UDP, length 82 
10:52:00.074440 08:00:27:26:45:5b > 0a:00:27:00:00:02, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
192.168.58.10.34890 > 2.4.6.8.5405: UDP, length 82 
10:52:00.105667 0a:00:27:00:00:02 > 08:00:27:26:45:5b, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 61, id 0, offset 0, flags [DF], proto UDP (17), length 110) 

Node 2 - tcpdump -envv "port 5405" -i eth1 
------------------------------------------------------------ 

10:55:12.229883 0a:00:27:00:00:02 > 08:00:27:05:af:40, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 61, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
1.2.3.4.34890 > 192.168.58.10.5405: UDP, length 82 
10:55:12.247341 08:00:27:05:af:40 > 0a:00:27:00:00:02, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
192.168.58.10.37357 > 1.2.3.4.5405: UDP, length 82 
10:55:12.457267 0a:00:27:00:00:02 > 08:00:27:05:af:40, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 61, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
1.2.3.4.34890 > 192.168.58.10.5405: UDP, length 82 
10:55:12.578838 08:00:27:05:af:40 > 0a:00:27:00:00:02, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
192.168.58.10.37357 > 1.2.3.4.5405: UDP, length 82 
10:55:12.819008 0a:00:27:00:00:02 > 08:00:27:05:af:40, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 61, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
1.2.3.4.34890 > 192.168.58.10.5405: UDP, length 82 
10:55:12.834251 08:00:27:05:af:40 > 0a:00:27:00:00:02, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
192.168.58.10.37357 > 1.2.3.4.5405: UDP, length 82 
10:55:13.043014 0a:00:27:00:00:02 > 08:00:27:05:af:40, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 61, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
1.2.3.4.34890 > 192.168.58.10.5405: UDP, length 82 
10:55:13.168621 08:00:27:05:af:40 > 0a:00:27:00:00:02, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
192.168.58.10.37357 > 1.2.3.4.5405: UDP, length 82 
10:55:13.410025 0a:00:27:00:00:02 > 08:00:27:05:af:40, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 61, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
1.2.3.4.34890 > 192.168.58.10.5405: UDP, length 82 
10:55:13.423383 08:00:27:05:af:40 > 0a:00:27:00:00:02, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
192.168.58.10.37357 > 1.2.3.4.5405: UDP, length 82 
10:55:13.633936 0a:00:27:00:00:02 > 08:00:27:05:af:40, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 61, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
1.2.3.4.34890 > 192.168.58.10.5405: UDP, length 82 
10:55:13.758722 08:00:27:05:af:40 > 0a:00:27:00:00:02, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
192.168.58.10.37357 > 1.2.3.4.5405: UDP, length 82 
10:55:14.000246 0a:00:27:00:00:02 > 08:00:27:05:af:40, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 61, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
1.2.3.4.34890 > 192.168.58.10.5405: UDP, length 82 
10:55:14.013566 08:00:27:05:af:40 > 0a:00:27:00:00:02, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
192.168.58.10.37357 > 1.2.3.4.5405: UDP, length 82 
10:55:14.223766 0a:00:27:00:00:02 > 08:00:27:05:af:40, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 61, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
1.2.3.4.34890 > 192.168.58.10.5405: UDP, length 82 
10:55:14.350019 08:00:27:05:af:40 > 0a:00:27:00:00:02, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 110) 
192.168.58.10.37357 > 1.2.3.4.5405: UDP, length 82 
10:55:14.603364 08:00:27:05:af:40 > 0a:00:27:00:00:02, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 110) 

Node 1 - corosync.conf 
------------------------------- 

totem { 
version: 2 

token: 5000 
token_retransmits_before_loss_const: 20 
join: 1000 
consensus: 7500 
vsftype: none 
max_messages: 20 
clear_node_high_bit: yes 
secauth: off 
threads: 0 
rrp_mode: passive 
interface { 
member { 
memberaddr: 192.168.58.10 
} 
member { 
memberaddr: 2.4.6.8 
} 
ringnumber: 0 
bindnetaddr: 192.168.58.10 
mcastport: 5405 
} 
transport: udpu 
} 
amf { 
mode: disabled 
} 

service { 
ver: 0 
name: pacemaker 
} 

aisexec { 
user: root 
group: root 
} 

logging { 
fileline: off 
to_logfile: yes 
to_syslog: yes 
debug: on 
logfile: /var/log/cluster/corosync.log 
debug: off 
timestamp: on 
logger_subsys { 
subsys: AMF 
debug: off 
tags: enter|leave|trace1|trace2|trace3|trace4|trace6 
} 
} 

Node 2 - corosync.conf 
------------------------------- 

totem { 
version: 2 

token: 5000 

token_retransmits_before_loss_const: 20 
join: 1000 
consensus: 7500 
vsftype: none 
max_messages: 20 
clear_node_high_bit: yes 
secauth: off 
threads: 0 

rrp_mode: passive 
interface { 
member { 
memberaddr: 192.168.58.10 
} 
member { 
memberaddr: 1.2.3.4 
} 
ringnumber: 0 
bindnetaddr: 192.168.58.10 
mcastport: 5405 
} 
transport: udpu 
} 
amf { 
mode: disabled 
} 

service { 
ver: 0 
name: pacemaker 
} 

aisexec { 
user: root 
group: root 
} 

logging { 
fileline: off 
to_logfile: yes 
to_syslog: yes 
debug: on 
logfile: /var/log/cluster/corosync.log 
debug: off 
timestamp: on 
logger_subsys { 
subsys: AMF 
debug: off 
tags: enter|leave|trace1|trace2|trace3|trace4|trace6 
} 
} 

node 1 - crm_mon -orVVVV1 
-------------------------------------- 

crm_mon[17706]: 2012/08/20_11:07:35 info: main: Starting crm_mon 
crm_mon[17706]: 2012/08/20_11:07:35 info: unpack_config: Startup probes: enabled 
crm_mon[17706]: 2012/08/20_11:07:35 notice: unpack_config: On loss of CCM Quorum: Ignore 
crm_mon[17706]: 2012/08/20_11:07:35 info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0 
crm_mon[17706]: 2012/08/20_11:07:35 info: unpack_domains: Unpacking domains 
crm_mon[17706]: 2012/08/20_11:07:35 info: determine_online_status: Node zhatest-01.domain.com is online 
crm_mon[17706]: 2012/08/20_11:07:35 notice: unpack_rsc_op: Hard error - ZimbraServer_last_failure_0 failed with rc=5: Preventing ZimbraServer from re-starting on zhatest-01.domain.com 
crm_mon[17706]: 2012/08/20_11:07:35 notice: unpack_rsc_op: Hard error - ZimbraFS_last_failure_0 failed with rc=5: Preventing ZimbraFS from re-starting on zhatest-01.domain.com 
crm_mon[17706]: 2012/08/20_11:07:35 WARN: unpack_rsc_op: Processing failed op ZimbraFS_last_failure_0 on zhatest-01.domain.com: not installed (5) 
crm_mon[17706]: 2012/08/20_11:07:35 notice: unpack_rsc_op: Operation ClusterOVHFailover_last_failure_0 found resource ClusterOVHFailover active on zhatest-01.domain.com 
crm_mon[17706]: 2012/08/20_11:07:35 WARN: unpack_rsc_op: Processing failed op ClusterOVHFailover_monitor_120000 on zhatest-01.domain.com: unknown exec error (-2) 
============ 
Last updated: Mon Aug 20 11:07:35 2012 
Last change: Sun Aug 19 23:06:39 2012 via crmd on zhatest-01.domain.com 
Stack: openais 
Current DC: zhatest-01.domain.com - partition WITHOUT quorum 
Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c 
2 Nodes configured, 2 expected votes 
9 Resources configured. 
============ 

Online: [ zhatest-01.domain.com ] 
OFFLINE: [ zhatest-02.domain.com ] 

Full list of resources: 

Resource Group: MySystem 
ClusterOVHFailover (ocf::btactic:OVHfailover): Started zhatest-01.domain.com FAILED 
ClusterIP (ocf::heartbeat:IPaddr2): Stopped 
ClusterHostRoute (ocf::btactic:OVHhostroute): Stopped 
DisableAlternativeRoute (ocf::btactic:OppositeRoute): Stopped 
ClusterDefaultRoute (ocf::btactic:OVHdefaultroute): Stopped 
Resource Group: MyZimbra 
ZimbraFS (ocf::heartbeat:Filesystem): Stopped 
ZimbraServer (ocf::btactic:zimbra): Stopped 
Master/Slave Set: ZimbraDataClone [ZimbraData] 
Masters: [ zhatest-01.domain.com ] 
Stopped: [ ZimbraData:1 ] 

Operations: 
* Node zhatest-01.domain.com: 
DisableAlternativeRoute: migration-threshold=1000000 
+ (58) monitor: interval=60000ms rc=0 (ok) 
+ (65) stop: rc=0 (ok) 
ClusterHostRoute: migration-threshold=1000000 
+ (56) monitor: interval=30000ms rc=0 (ok) 
+ (66) stop: rc=0 (ok) 
ClusterIP: migration-threshold=1000000 
+ (54) monitor: interval=30000ms rc=0 (ok) 
+ (67) stop: rc=0 (ok) 
ZimbraServer: migration-threshold=1000000 
+ (8) probe: rc=5 (not installed) 
ClusterDefaultRoute: migration-threshold=1000000 
+ (60) monitor: interval=30000ms rc=0 (ok) 
+ (63) stop: rc=0 (ok) 
crm_mon[17706]: 2012/08/20_11:07:35 info: get_failcount: ZimbraFS has failed INFINITY times on zhatest-01.domain.com 
ZimbraFS: migration-threshold=1000000 fail-count=1000000 
+ (24) start: rc=5 (not installed) 
+ (26) stop: rc=0 (ok) 
ZimbraData:0: migration-threshold=1000000 
+ (25) monitor: interval=60000ms rc=8 (master) 
+ (61) promote: rc=0 (ok) 
crm_mon[17706]: 2012/08/20_11:07:35 info: get_failcount: ClusterOVHFailover has failed 3 times on zhatest-01.domain.com 
ClusterOVHFailover: migration-threshold=1000000 fail-count=3 
+ (2) probe: rc=0 (ok) 
+ (51) start: rc=0 (ok) 
+ (52) monitor: interval=120000ms rc=-2 (unknown exec error) 

Failed actions: 
ZimbraServer_monitor_0 (node=zhatest-01.domain.com, call=8, rc=5, status=complete): not installed 
ZimbraFS_start_0 (node=zhatest-01.domain.com, call=24, rc=5, status=complete): not installed 
ClusterOVHFailover_monitor_120000 (node=zhatest-01.domain.com, call=52, rc=-2, status=Timed Out): unknown exec error 

node 2 - crm_mon -orVVVV1 
-------------------------------------- 

crm_mon[14699]: 2012/08/20_11:13:14 info: main: Starting crm_mon 
crm_mon[14699]: 2012/08/20_11:13:14 info: unpack_config: Startup probes: enabled 
crm_mon[14699]: 2012/08/20_11:13:14 notice: unpack_config: On loss of CCM Quorum: Ignore 
crm_mon[14699]: 2012/08/20_11:13:14 info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0 
crm_mon[14699]: 2012/08/20_11:13:14 info: unpack_domains: Unpacking domains 
crm_mon[14699]: 2012/08/20_11:13:14 info: determine_online_status: Node zhatest-02.domain.com is online 
crm_mon[14699]: 2012/08/20_11:13:14 notice: unpack_rsc_op: Hard error - ZimbraServer_last_failure_0 failed with rc=5: Preventing ZimbraServer from re-starting on zhatest-02.domain.com 
crm_mon[14699]: 2012/08/20_11:13:14 notice: unpack_rsc_op: Operation ZimbraData:0_last_failure_0 found resource ZimbraData:0 active on zhatest-02.domain.com 
crm_mon[14699]: 2012/08/20_11:13:14 WARN: unpack_rsc_op: Processing failed op ClusterOVHFailover_monitor_120000 on zhatest-02.domain.com: unknown exec error (-2) 
crm_mon[14699]: 2012/08/20_11:13:14 notice: unpack_rsc_op: Hard error - ClusterOVHFailover_last_failure_0 failed with rc=5: Preventing ClusterOVHFailover from re-starting on zhatest-02.domain.com 
crm_mon[14699]: 2012/08/20_11:13:14 WARN: unpack_rsc_op: Processing failed op ClusterOVHFailover_last_failure_0 on zhatest-02.domain.com: not installed (5) 
crm_mon[14699]: 2012/08/20_11:13:14 info: native_add_running: resource ClusterOVHFailover isnt managed 
============ 
Last updated: Mon Aug 20 11:13:14 2012 
Last change: Sun Aug 19 23:07:10 2012 via crmd on zhatest-02.domain.com 
Stack: openais 
Current DC: zhatest-02.domain.com - partition WITHOUT quorum 
Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c 
2 Nodes configured, 2 expected votes 
9 Resources configured. 
============ 

Online: [ zhatest-02.domain.com ] 
OFFLINE: [ zhatest-01.domain.com ] 

Full list of resources: 

Resource Group: MySystem 
ClusterOVHFailover (ocf::btactic:OVHfailover): Started zhatest-02.domain.com (unmanaged) FAILED 
ClusterIP (ocf::heartbeat:IPaddr2): Stopped 
ClusterHostRoute (ocf::btactic:OVHhostroute): Stopped 
DisableAlternativeRoute (ocf::btactic:OppositeRoute): Stopped 
ClusterDefaultRoute (ocf::btactic:OVHdefaultroute): Stopped 
Resource Group: MyZimbra 
ZimbraFS (ocf::heartbeat:Filesystem): Stopped 
ZimbraServer (ocf::btactic:zimbra): Stopped 
Master/Slave Set: ZimbraDataClone [ZimbraData] 
Slaves: [ zhatest-02.domain.com ] 
Stopped: [ ZimbraData:1 ] 

Operations: 
* Node zhatest-02.domain.com: 
DisableAlternativeRoute: migration-threshold=1000000 
+ (18) monitor: interval=60000ms rc=0 (ok) 
+ (25) stop: rc=0 (ok) 
ClusterHostRoute: migration-threshold=1000000 
+ (16) monitor: interval=30000ms rc=0 (ok) 
+ (26) stop: rc=0 (ok) 
ClusterIP: migration-threshold=1000000 
+ (14) monitor: interval=30000ms rc=0 (ok) 
+ (27) stop: rc=0 (ok) 
ZimbraServer: migration-threshold=1000000 
+ (8) probe: rc=5 (not installed) 
ClusterDefaultRoute: migration-threshold=1000000 
+ (20) monitor: interval=30000ms rc=0 (ok) 
+ (23) stop: rc=0 (ok) 
ZimbraData:0: migration-threshold=1000000 
+ (9) probe: rc=0 (ok) 
+ (30) demote: rc=0 (ok) 
+ (32) monitor: interval=50000ms rc=0 (ok) 
crm_mon[14699]: 2012/08/20_11:13:14 info: get_failcount: ClusterOVHFailover has failed INFINITY times on zhatest-02.domain.com 
ClusterOVHFailover: migration-threshold=1000000 fail-count=1000000 
+ (10) start: rc=0 (ok) 
+ (12) monitor: interval=120000ms rc=-2 (unknown exec error) 
+ (28) stop: rc=5 (not installed) 

Failed actions: 
ZimbraServer_monitor_0 (node=zhatest-02.domain.com, call=8, rc=5, status=complete): not installed 
ClusterOVHFailover_monitor_120000 (node=zhatest-02.domain.com, call=12, rc=-2, status=Timed Out): unknown exec error 
ClusterOVHFailover_stop_0 (node=zhatest-02.domain.com, call=28, rc=5, status=complete): not installed 

Specific details for this setup: 
-------------------------------------- 
* Although both nodes have the same 192.168.58.10 as you might see in the logs they live in two different networks... I mean... there's no problem about 192.168.58.10 begin repeated as far as I know. 

* 1.2.3.4 is the public ip for node 1 
* 2.4.6.8 is the public ip for node 2 
* 5405 udp is redirected from 1.2.3.4 public ip to its internal ip 192.168.58.10 
* 5405 udp is redirected from 1.2.3.4 public ip to its internal ip 192.168.58.10 
* Node 1 name is: zhatest-01.domain.com 
* Node 2 name is: zhatest-02.domain.com 

Long description 
--------------------- 
HA nodes doesn't seem to communicate to each other via corosync. That's what I infer from crm_mon output although I'm not an expert (One is online and the other one is offline... and in the other host it is the other way). I also infer that there's some kind of communication because of tcpdump seeing packages in both senses of the communication (although it's the first time I see a tcpdump output so I might be wrong too). 

Is there any tool/command that actually checks if corosync is communicating or not to the other node? 
And some other commands to debug it so that I can find where the problem is? 
Am I perhaps wrong and corosync is communicating fine and the problem is elsewhere? 

If you need more logs or details about the setup do not hesitate to ask for them. 

Thank you. 

-- 

-- 
Adrián Gibanel 
I.T. Manager 

+34 675 683 301 
www.btactic.com 

Ens podeu seguir a/Nos podeis seguir en: 

i 

Abans d´imprimir aquest missatge, pensa en el medi ambient. El medi ambient és cosa de tothom. / Antes de imprimir el mensaje piensa en el medio ambiente. El medio ambiente es cosa de todos. 

AVIS: 
El contingut d'aquest missatge i els seus annexos és confidencial. Si no en sou el destinatari, us fem saber que està prohibit utilitzar-lo, divulgar-lo i/o copiar-lo sense tenir l'autorització corresponent. Si heu rebut aquest missatge per error, us agrairem que ho feu saber immediatament al remitent i que procediu a destruir el missatge . 

AVISO: 
El contenido de este mensaje y de sus anexos es confidencial. Si no es el destinatario, les hacemos saber que está prohibido utilizarlo, divulgarlo y/o copiarlo sin tener la autorización correspondiente. Si han recibido este mensaje por error, les agradeceríamos que lo hagan saber inmediatamente al remitente y que procedan a destruir el mensaje . 
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss