Re: [PATCH] Add a troubleshooting guide to corosync.conf.5

Dmitry Koterov <dmitry.koterov@xxxxxxxxx> · Wed, 14 Jan 2015 19:18:49 +0300

> Please also add a note that one should specify IP addresses in ringX_addr

> directives, not a domain name. Else corosync does not work properly in UDPu

> mode, and at the same time it does not say anything significant in its log

> files. I've spent 4 hours recently trying to figure this out.

>

As I was replying you on PCMK list. ringX_addr resolving should work as

expected (I'm using only this configuration and same applies for most of

the cluster created by pcs). Even if ringX_addr resolving would be

broken, it's for sure not something appropriate for "TROUBLESHOOTING",

but it's really about bug fix.

Can you please attach corosync logs, so you would make possible for us

to find root cause of problem you are hitting? (ideally with debug enabled).

Sure, here they are:
http://oss.clusterlabs.org/pipermail/pacemaker/2015-January/023320.html

The complete NON-WORKING corosync.conf is (note that instead of "a.b.c.d" I have a plain IP address):

# THIS IS A NON-WORKING CONFIGURATION DUE TO non-IP addresses in ringX_addr!
totem {
    version: 2
    cluster_name: velvica
    secauth: on
    clear_node_high_bit: yes
    interface {
        ringnumber: 0
        bindnetaddr: a.b.c.d
        mcastport: 5405
        ttl: 1
    }
    transport: udpu
    heartbeat_failures_allowed: 3
}
logging {
    fileline: off
    to_logfile: no
    to_syslog: yes
    debug: off
    timestamp: off
    logger_subsys {
        subsys: QUORUM
        debug: off
    }
}
nodelist {
  node {
    ring0_addr: node1  # <-- seems not working, IP address is needed
  }
  node {
    ring0_addr: node2
  }
  node {
    ring0_addr: node3
  }
}
quorum {
    provider: corosync_votequorum
}

If I then replace node1, node2, node3 with their IP addresses, everything becomes working. See /var/log/syslog output at http://oss.clusterlabs.org/pipermail/pacemaker/2015-January/023320.html

> On Monday, January 5, 2015, Jan Pokorný <jpokorny@xxxxxxxxxx> wrote:

>

>> (if you let me, some more in-line)

>>

>> On 05/01/15 16:20 +0000, Christine Caulfield wrote:

>>> Looks good to me, thanks. I've fixed a few typos and pointed out a

>> spurious

>>> capital inline below

>>>

>>> On 05/01/15 14:39, Steven Dake wrote:

>>>> Add a troubleshooting guide.  I'm sure other folks have some good stuff

>>>> to put in here.  These are just the ones I know about :)

>>>>

>>>> Signed-off-by: Steven Dake <sdake@xxxxxxxxxx <_javascript_:;>>

>>>> ---

>>>>  man/corosync.conf.5 | 39 +++++++++++++++++++++++++++++++++++++++

>>>>  1 file changed, 39 insertions(+)

>>>>

>>>> diff --git a/man/corosync.conf.5 b/man/corosync.conf.5

>>>> index 8e774c1..16d84ca 100644

>>>> --- a/man/corosync.conf.5

>>>> +++ b/man/corosync.conf.5

>>>> @@ -678,6 +678,45 @@ Native means one of shm or socket, depending on

>> what is supported by OS. On syst

>>>>  with support for both, SHM is selected. SHM is generally faster, but

>> need to allocate

>>>>  ring buffer file in /dev/shm.

>>>>

>>>> +.SH "TROUBLESHOOTING"

>>>> +.TP

>>>> +Ocassionally Corosync will not work with the default network.  Here

>> are some

>>     ^^^ Occasionally

>>

>>>> +common tips that people have used to find a working Corosync.

>>>> +

>>>> +.TP

>>>> +Disable the firewall.  The firwall could block Corosync packets from

>> reaching

>>>                             ^^firewall

>>>> +the network.

>>>> +

>>>> +.TP

>>>> +Force IGMP v2.  Some modern switches do not support the kernel IGMP v3

>>>> + protocol.  As a result, They will not properly register the cluster.

>> To do

>>                              ^^^ they

>>

>>>> +this, simply run the command

>>>> +

>>>> +.BR sysctl -w net.ipv4.conf.all.force_igmp_version=2

>>>> +

>>>> +.TP

>>>> +If on a routed network, set a larger ttl.  The TTL tells the routers

>> how long

>>>> +to let the packet multicast before dropping it permanently.  The

>> Default ttl

>>>                                                              ^^^ default

>>

>> (inconsistent casing of ttl/TTL)

>>

>>>> +is set to 1, which means the packet will drop after its first hop.

>> This will

>>>> +not work well on a routed network.

>>>> +

>>>> +.TP

>>>> +I use a VLAN and Corosync doesn't work.  If your using a VLAN, VLAN's

>> shave the

>>>                                            ^^^ you're             VLANs

>>>

>>>> +packet size available for Corosync to use in some cases. Corosync does

>> not

>>>> +automatically adjust to this change.  Set netmtu appropriately when

>> using a

>>>> +VLAN.

>>>> +

>>>> +.TP

>>>> +If all else fails, use UDPU.  The authors implemented UDPU to solve

>> the various

>>>> +problems with multicast that plague modern switch implementations.

>> The UDPU

>>>> +protocol was initially believed to be much slower but the reality after

>>>> +implementation is that it doesn't make much difference.

>>>> +

>>>> +Even with UDPU you would be hard pressed to find a faster group

>> messaging

>>>> +system than Corosync.  The only downside of UDPU is it results in much

>> more

>>>> +packet copying across the network.

>>>> +

>>>> +

>>>>  .SH "FILES"

>>>>  .TP

>>>>  /etc/corosync/corosync.conf

>>

>> --

>> Jan

>>

>

>

>

> _______________________________________________

> discuss mailing list

> discuss@xxxxxxxxxxxx

> http://lists.corosync.org/mailman/listinfo/discuss

>

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss