Re: cluster won't form - token lost in commit state

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Patrick,


I'm trying to get corosync running inside 2 docker containers. One of
them is spewing out lots of "The token was lost in the COMMIT state."
messages. The other is simply logging "The consensus timeout expired."
(which given the state of the other node, is expected).

Googling the commit state message turns up almost nothing, so I have no
clue what it means.

Both nodes are inside docker containers which each get NATed before
leaving the server (using UDPU). I've taken this into consideration and
have manually set the nodeid for each so that it's not based off the IP
address.

NAT is problem. Basically, config file has to be in sync, what is not the case.

But you can use iptables DNAT magic to make it work. Please take your time to read thread:

http://lists.corosync.org/pipermail/discuss/2012-August/001865.html

There is more in depth explanation + solution.

Regards,
  Honza


tcpdump shows me that both nodes are receiving traffic from the other
node. However the node which is throwing the 'lost in commit state' is
only sending a packet every few seconds, where as the 'consensus
timeout' node is sending a ton of packets.


Node 1:
------------
Name: i-cd3b0393
Container IP: 172.17.0.21 (the IP corosync binds to)
Server IP: 10.20.27.52
Version: 2.3.3 (Fedora 20)


corosync.conf:
     totem {
       version: 2
       token: 2000
       token_retransmits_before_loss_const: 10
       vsftype: none
       secauth: off
       transport: udpu
     }

     logging {
       fileline: off
       syslog_facility: local2
       syslog_priority: debug
     }

     quorum {
       provider: corosync_votequorum
     }

     nodelist {
       node {
         nodeid: 1862911301
         ring0_addr: i-a2542ffc
       }
       node {
         nodeid: 2585129852
         ring0_addr: i-cd3b0393
       }
     }


/etc/hosts:
     172.17.0.21    i-cd3b0393
     10.20.50.204 i-a2542ffc


logs:
     Aug 29 02:53:17 i-cd3b0393 local2.info corosync[318]:  [TOTEM ] The
consensus timeout expired.
     Aug 29 02:53:17 i-cd3b0393 local2.info corosync[318]:  [TOTEM ]
entering GATHER state from 3(The consensus timeout expired.).
     Aug 29 02:53:18 i-cd3b0393 local2.warn corosync[318]:  [MAIN  ]
Totem is unable to form a cluster because of an operating system or
network fault. The most common cause of this message is that the local
firewall is configured improperly.
     Aug 29 02:53:19 i-cd3b0393 local2.warn corosync[318]:  [MAIN  ]
Totem is unable to form a cluster because of an operating system or
network fault. The most common cause of this message is that the local
firewall is configured improperly.
     Aug 29 02:53:21 i-cd3b0393 local2.warn corosync[318]:  [MAIN  ]
Totem is unable to form a cluster because of an operating system or
network fault. The most common cause of this message is that the local
firewall is configured improperly.
     Aug 29 02:53:21 i-cd3b0393 local2.info corosync[318]:  [TOTEM ] The
consensus timeout expired.
     Aug 29 02:53:21 i-cd3b0393 local2.info corosync[318]:  [TOTEM ]
entering GATHER state from 3(The consensus timeout expired.).
     Aug 29 02:53:22 i-cd3b0393 local2.warn corosync[318]:  [MAIN  ]
Totem is unable to form a cluster because of an operating system or
network fault. The most common cause of this message is that the local
firewall is configured improperly.
     Aug 29 02:53:24 i-cd3b0393 local2.warn corosync[318]:  [MAIN  ]
Totem is unable to form a cluster because of an operating system or
network fault. The most common cause of this message is that the local
firewall is configured improperly.
     Aug 29 02:53:25 i-cd3b0393 local2.warn corosync[318]:  [MAIN  ]
Totem is unable to form a cluster because of an operating system or
network fault. The most common cause of this message is that the local
firewall is configured improperly.
     Aug 29 02:53:26 i-cd3b0393 local2.info corosync[318]:  [TOTEM ] The
consensus timeout expired.
     Aug 29 02:53:26 i-cd3b0393 local2.info corosync[318]:  [TOTEM ]
entering GATHER state from 3(The consensus timeout expired.).


tcpdump:
     03:03:58.846382 IP 172.17.0.21.57910 > 10.20.50.204.5405: UDP,
length 163
     03:03:58.896435 IP 172.17.0.21.57910 > 10.20.50.204.5405: UDP,
length 163
     03:03:58.945786 IP 10.20.50.204.37971 > 172.17.0.21.5405: UDP,
length 163
     03:03:58.946487 IP 172.17.0.21.57910 > 10.20.50.204.5405: UDP,
length 163
     03:03:58.996544 IP 172.17.0.21.57910 > 10.20.50.204.5405: UDP,
length 163


corosync-quorumtool:
     Quorum information
     ------------------
     Date:             Fri Aug 29 02:57:45 2014
     Quorum provider:  corosync_votequorum
     Nodes:            1
     Node ID:          2585129852
     Ring ID:          2904
     Quorate:          No

     Votequorum information
     ----------------------
     Expected votes:   2
     Highest expected: 2
     Total votes:      1
     Quorum:           2 Activity blocked
     Flags:

     Membership information
     ----------------------
         Nodeid      Votes Name
     2585129852          1 i-cd3b0393 (local)


========================================

Node 2:
------------
Name: i-a2542ffc
Container IP: 172.17.0.7 (the IP corosync binds to)
Server IP: 10.20.50.204
Version: 2.3.3 (Fedora 20)


corosync.conf:
     totem {
       version: 2
       token: 2000
       token_retransmits_before_loss_const: 10
       vsftype: none
       secauth: off
       transport: udpu
     }

     logging {
       fileline: off
       syslog_facility: local2
       syslog_priority: debug
     }

     quorum {
       provider: corosync_votequorum
     }

     nodelist {
       node {
         nodeid: 1862911301
         ring0_addr: i-a2542ffc
       }
       node {
         nodeid: 2585129852
         ring0_addr: i-cd3b0393
       }
     }


/etc/hosts:
     172.17.0.7    i-a2542ffc
     10.20.27.52 i-cd3b0393


logs:
     Aug 29 02:53:03 i-a2542ffc local2.info corosync[279]:  [TOTEM ] The
token was lost in the COMMIT state.
     Aug 29 02:53:03 i-a2542ffc local2.info corosync[279]:  [TOTEM ]
entering GATHER state from 4(The token was lost in the COMMIT state.).
     Aug 29 02:53:03 i-a2542ffc local2.info corosync[279]:  [TOTEM ]
Creating commit token because I am the rep.
     Aug 29 02:53:03 i-a2542ffc local2.info corosync[279]:  [TOTEM ]
Storing new sequence id for ring 1b88
     Aug 29 02:53:03 i-a2542ffc local2.info corosync[279]:  [TOTEM ]
entering COMMIT state.
     Aug 29 02:53:05 i-a2542ffc local2.info corosync[279]:  [TOTEM ] The
token was lost in the COMMIT state.
     Aug 29 02:53:05 i-a2542ffc local2.info corosync[279]:  [TOTEM ]
entering GATHER state from 4(The token was lost in the COMMIT state.).
     Aug 29 02:53:05 i-a2542ffc local2.info corosync[279]:  [TOTEM ]
Creating commit token because I am the rep.
     Aug 29 02:53:05 i-a2542ffc local2.info corosync[279]:  [TOTEM ]
Storing new sequence id for ring 1b8c
     Aug 29 02:53:05 i-a2542ffc local2.info corosync[279]:  [TOTEM ]
entering COMMIT state.
     Aug 29 02:53:07 i-a2542ffc local2.info corosync[279]:  [TOTEM ] The
token was lost in the COMMIT state.
     Aug 29 02:53:07 i-a2542ffc local2.info corosync[279]:  [TOTEM ]
entering GATHER state from 4(The token was lost in the COMMIT state.).
     Aug 29 02:53:07 i-a2542ffc local2.info corosync[279]:  [TOTEM ]
Creating commit token because I am the rep.
     Aug 29 02:53:07 i-a2542ffc local2.info corosync[279]:  [TOTEM ]
Storing new sequence id for ring 1b90
     Aug 29 02:53:07 i-a2542ffc local2.info corosync[279]:  [TOTEM ]
entering COMMIT state.


tcpdump:
     03:04:25.137038 IP 10.20.27.52.57910 > 172.17.0.7.5405: UDP, length 163
     03:04:25.187086 IP 10.20.27.52.57910 > 172.17.0.7.5405: UDP, length 163
     03:04:25.235829 IP 172.17.0.7.37971 > 10.20.27.52.5405: UDP, length 163
     03:04:25.237123 IP 10.20.27.52.57910 > 172.17.0.7.5405: UDP, length 163
     03:04:25.287847 IP 10.20.27.52.57910 > 172.17.0.7.5405: UDP, length 163


corosync-quorumtool:
     Quorum information
     ------------------
     Date:             Fri Aug 29 02:57:19 2014
     Quorum provider:  corosync_votequorum
     Nodes:            1
     Node ID:          1862911301
     Ring ID:          4488
     Quorate:          No

     Votequorum information
     ----------------------
     Expected votes:   2
     Highest expected: 2
     Total votes:      1
     Quorum:           2 Activity blocked
     Flags:

     Membership information
     ----------------------
         Nodeid      Votes Name
     1862911301          1 i-a2542ffc (local)




_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss


_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss




[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux