Hello there.
Has anybody tried to run corosync on a cluster with 2 rings on different
ib fabrics?
We have several issues here.
First, usually corosync aborts:
---8<---
# corosync -f
notice [MAIN ] Corosync Cluster Engine ('2.0.1'): started and ready to provide service.
info [MAIN ] Corosync built-in features: testagents rdma monitoring
Sep 20 11:28:50 notice [TOTEM ] Initializing transport (Infiniband/IP).
Sep 20 11:28:50 notice [TOTEM ] Initializing transport (Infiniband/IP).
corosync: totemsrp.c:3236: memb_ring_id_create_or_load: Assertion `!totemip_zero_check(&memb_ring_id->rep)' failed.
Ringbuffer:
->OVERWRITE
->write_pt [736]
->read_pt [0]
->size [2097152 words]
=>free [8385660 bytes]
=>used [2944 bytes]
Aborted
--->8---
Then, from time to time it just seg. faults:
---8<---
# corosync -f
notice [MAIN ] Corosync Cluster Engine ('2.0.1'): started and ready to provide service.
info [MAIN ] Corosync built-in features: testagents rdma monitoring
Sep 20 11:28:51 notice [TOTEM ] Initializing transport (Infiniband/IP).
Sep 20 11:28:51 notice [TOTEM ] Initializing transport (Infiniband/IP).
Sep 20 11:28:51 notice [SERV ] Service engine loaded: corosync configuration map access [0]
Sep 20 11:28:51 info [QB ] server name: cmap
Sep 20 11:28:51 notice [SERV ] Service engine loaded: corosync configuration service [1]
Sep 20 11:28:51 info [QB ] server name: cfg
Sep 20 11:28:51 notice [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Sep 20 11:28:51 info [QB ] server name: cpg
Sep 20 11:28:51 notice [SERV ] Service engine loaded: corosync profile loading service [4]
Sep 20 11:28:51 notice [SERV ] Service engine loaded: corosync resource monitoring service [6]
Sep 20 11:28:51 notice [QUORUM] Using quorum provider corosync_votequorum
Sep 20 11:28:51 notice [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5]
Sep 20 11:28:51 info [QB ] server name: votequorum
Sep 20 11:28:51 notice [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Sep 20 11:28:51 info [QB ] server name: quorum
Ringbuffer:
->OVERWRITE
->write_pt [2776]
->read_pt [0]
->size [2097152 words]
=>free [8377500 bytes]
=>used [11104 bytes]
Segmentation fault
--->8---
And sometimes it starts.
Then, when the engines start on 5 nodes, two of them show errors like:
---8<---
...
mlx4: local QP operation err (QPN 32004d, WQE index 0, vendor syndrome 6b, opcode = 5e)
mlx4: local QP operation err (QPN 3a004d, WQE index 0, vendor syndrome 6b, opcode = 5e)
...
--->8---
And the last one cannot join the others in several seconds.
---8<---
...
Sep 20 11:14:39 notice [MAIN ] Completed service synchronization, ready to provide service.
Sep 20 11:14:39 notice [QUORUM] Members[1]: 83929280
Sep 20 11:14:39 notice [TOTEM ] A processor joined or left the membership and a new membership (192.168.0.5:256) was formed.
Sep 20 11:14:39 notice [MAIN ] Completed service synchronization, ready to provide service.
Sep 20 11:14:42 notice [QUORUM] Members[1]: 83929280
Sep 20 11:14:42 notice [TOTEM ] A processor joined or left the membership and a new membership (192.168.0.5:264) was formed.
...
--->8---
The corosync.conf is:
---8<---
totem {
version: 2
# How long before declaring a token lost (ms)
token: 3000
# How many token retransmits before forming a new configuration
token_retransmits_before_loss_const: 10
# How long to wait for join messages in the membership protocol (ms)
join: 60
# How long to wait for consensus to be achieved before starting a new round of membership configuration (ms)
consensus: 3600
# Turn off the virtual synchrony filter
vsftype: none
# Number of messages that may be sent by one processor on receipt of the token
max_messages: 20
# Limit generated nodeids to 31-bits (positive signed integers)
clear_node_high_bit: yes
# Disable encryption
secauth: off
# How many threads to use for encryption/decryption
threads: 0
# Optionally assign a fixed node id (integer)
# nodeid: 1234
# This specifies the mode of redundant ring, which may be none, active, or passive.
rrp_mode: passive
interface {
# The following values need to be set based on your environment
ringnumber: 0
bindnetaddr: 192.168.0.0
mcastaddr: 192.168.0.255
mcastport: 5405
}
interface {
# The following values need to be set based on your environment
ringnumber: 1
bindnetaddr: 192.168.1.0
mcastaddr: 192.168.1.255
mcastport: 5405
}
netmtu: 2044
transport: iba
}
amf {
mode: disabled
}
service {
# Load the Pacemaker Cluster Resource Manager
ver: 0
name: pacemaker
}
aisexec {
user: root
group: root
}
logging {
fileline: off
to_stderr: yes
to_logfile: no
to_syslog: yes
syslog_facility: daemon
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
tags: enter|leave|trace1|trace2|trace3|trace4|trace6
}
}
quorum {
# Enable and configure quorum subsystem (default: off)
# see also corosync.conf.5 and votequorum.5
provider: corosync_votequorum
expected_votes: 3
}
--->8---
We run self-compiled cs on a Debian squeeze hosts.
Thank you.
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss