Redundant Infiniband Fabrics

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello there.

Has anybody tried to run corosync on a cluster with 2 rings on different ib fabrics?
We have several issues here.
First, usually corosync aborts:

---8<---

# corosync -f

notice  [MAIN  ] Corosync Cluster Engine ('2.0.1'): started and ready to provide service.

info    [MAIN  ] Corosync built-in features: testagents rdma monitoring

Sep 20 11:28:50 notice  [TOTEM ] Initializing transport (Infiniband/IP).

Sep 20 11:28:50 notice  [TOTEM ] Initializing transport (Infiniband/IP).

corosync: totemsrp.c:3236: memb_ring_id_create_or_load: Assertion `!totemip_zero_check(&memb_ring_id->rep)' failed.

Ringbuffer:

 ->OVERWRITE

 ->write_pt [736]

 ->read_pt [0]

 ->size [2097152 words]

 =>free [8385660 bytes]

 =>used [2944 bytes]

Aborted

--->8---

Then, from time to time it just seg. faults:

---8<---

# corosync -f

notice  [MAIN  ] Corosync Cluster Engine ('2.0.1'): started and ready to provide service.

info    [MAIN  ] Corosync built-in features: testagents rdma monitoring

Sep 20 11:28:51 notice  [TOTEM ] Initializing transport (Infiniband/IP).

Sep 20 11:28:51 notice  [TOTEM ] Initializing transport (Infiniband/IP).

Sep 20 11:28:51 notice  [SERV  ] Service engine loaded: corosync configuration map access [0]

Sep 20 11:28:51 info    [QB    ] server name: cmap

Sep 20 11:28:51 notice  [SERV  ] Service engine loaded: corosync configuration service [1]

Sep 20 11:28:51 info    [QB    ] server name: cfg

Sep 20 11:28:51 notice  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]

Sep 20 11:28:51 info    [QB    ] server name: cpg

Sep 20 11:28:51 notice  [SERV  ] Service engine loaded: corosync profile loading service [4]

Sep 20 11:28:51 notice  [SERV  ] Service engine loaded: corosync resource monitoring service [6]

Sep 20 11:28:51 notice  [QUORUM] Using quorum provider corosync_votequorum

Sep 20 11:28:51 notice  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]

Sep 20 11:28:51 info    [QB    ] server name: votequorum

Sep 20 11:28:51 notice  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]

Sep 20 11:28:51 info    [QB    ] server name: quorum

Ringbuffer:

 ->OVERWRITE

 ->write_pt [2776]

 ->read_pt [0]

 ->size [2097152 words]

 =>free [8377500 bytes]

 =>used [11104 bytes]

Segmentation fault

--->8---

And sometimes it starts.
Then, when the engines start on 5 nodes, two of them show errors like:

---8<---

...
mlx4: local QP operation err (QPN 32004d, WQE index 0, vendor syndrome 6b, opcode = 5e)
mlx4: local QP operation err (QPN 3a004d, WQE index 0, vendor syndrome 6b, opcode = 5e)
...

--->8---

And the last one cannot join the others in several seconds.

---8<---
...
Sep 20 11:14:39 notice  [MAIN  ] Completed service synchronization, ready to provide service.
Sep 20 11:14:39 notice  [QUORUM] Members[1]: 83929280
Sep 20 11:14:39 notice  [TOTEM ] A processor joined or left the membership and a new membership (192.168.0.5:256) was formed.
Sep 20 11:14:39 notice  [MAIN  ] Completed service synchronization, ready to provide service.
Sep 20 11:14:42 notice  [QUORUM] Members[1]: 83929280
Sep 20 11:14:42 notice  [TOTEM ] A processor joined or left the membership and a new membership (192.168.0.5:264) was formed.
...
--->8---

The corosync.conf is:

---8<---

totem {

        version: 2

        # How long before declaring a token lost (ms)

        token: 3000

        # How many token retransmits before forming a new configuration

        token_retransmits_before_loss_const: 10

        # How long to wait for join messages in the membership protocol (ms)

        join: 60

        # How long to wait for consensus to be achieved before starting a new round of membership configuration (ms)

        consensus: 3600

        # Turn off the virtual synchrony filter

        vsftype: none

        # Number of messages that may be sent by one processor on receipt of the token

        max_messages: 20

        # Limit generated nodeids to 31-bits (positive signed integers)

        clear_node_high_bit: yes

        # Disable encryption

        secauth: off

        # How many threads to use for encryption/decryption

        threads: 0

        # Optionally assign a fixed node id (integer)

        # nodeid: 1234

        # This specifies the mode of redundant ring, which may be none, active, or passive.

        rrp_mode: passive

        interface {

                # The following values need to be set based on your environment

                ringnumber: 0

                bindnetaddr: 192.168.0.0

                mcastaddr: 192.168.0.255

                mcastport: 5405

        }

        interface {

                # The following values need to be set based on your environment

                ringnumber: 1

                bindnetaddr: 192.168.1.0

                mcastaddr: 192.168.1.255

                mcastport: 5405

        }

        netmtu: 2044

        transport: iba

}

amf {

        mode: disabled

}

service {

        # Load the Pacemaker Cluster Resource Manager

        ver:       0

        name:      pacemaker

}

aisexec {

        user:   root

        group:  root

}

logging {

        fileline: off

        to_stderr: yes

        to_logfile: no

        to_syslog: yes

        syslog_facility: daemon

        debug: off

        timestamp: on

        logger_subsys {

                subsys: AMF

                debug: off

                tags: enter|leave|trace1|trace2|trace3|trace4|trace6

        }

}

quorum {

        # Enable and configure quorum subsystem (default: off)

        # see also corosync.conf.5 and votequorum.5

        provider: corosync_votequorum

        expected_votes: 3

}

--->8---

We run self-compiled cs on a Debian squeeze hosts.

Thank you.
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss


[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux