Re: Corosync 1.4.6 locks up

Allan Latham <alatham@xxxxxxxxxxxxxxxx> · Fri, 23 Aug 2013 12:53:04 +0200

Hi Honza

It may of course be a pacemaker problem. I have no idea where to look
for clues -:)

> What version of pacemaker you are using? Are you using cpg based
> one, or plugin based one

root@v100 /root # dpkg -l|grep pacemaker
ii  pacemaker 1.1.7-1 amd64 HA cluster resource manager

I have no idea if it's cpg or plugin. We like to stick with whatever
comes with the latest Debian unless there is some very good reason not
to - simply to keep future maintenance to a minimum.

The Debian expermental/unstable version is 1.1.10+ but has 9 pages of
dependencies - I really don't want to go there.

Any ideas how I can get pacemaker diagnostics - the corosync log seems
to contain some items which I would think come from pacemaker. Is there
anything else to look at? Is there a 'debug' level?

All the best and many many thanks for all the help.

Allan

On 23/08/13 11:08, Jan Friesse wrote:
> Allan,
> sorry for late reply, but there were more urgent things.
> 
> Anyway,
>> I wonder also if 5 servers in a ring with wan links is simply too
> many. Do the possible problems increase by N**2?
> 
> If you were using 20 sec timeout it should just work. I mean, token
> timeout should be set to sum of rtt between nodes in ring + token hold
> timeout * no_nodes. It really shouldn't be N**2.
> 
> Drop-in replacement is always in same X.Y series, so 1.4.x it's 1.4.6,
> same for 2.3.x it's 2.3.1. When Y is changing, everything works well in
> most cases but there may be bigger change (like 1.3.x -> 1.4.x). When X
> is changing (so 1.x.y -> 2.x.y) there are HUGE changes. For example
> running 2.3.1 and 1.4.6 on same net (same broadcast addr, port, ...)
> will simply not work.
> 
> 
> Allan Latham napsal(a):
>> Hi Honza
>>
>> I decided for 1.4.6 as I already had the .deb packages I made for the
>> ipv6 tests.
>>
>> All works well except that when I kill the server running as DC which is
>> also running the active HA servers.
>>
>> The setup is two real servers running proxmox each with one KVM virutal
>> machine.
>>
>> Each VM runs corosync 1.4.6 and a virtual server from Hetzner runs the
>> same thus giving me a 3 node corosync/pacemaker cluster. The VS from
>> Hetzner never runs any real service - it's just for quorum purposes. The
>> two KVMs run ip changeover, drbd and mysql.
>>
>> I have done many tests with controlled handover - i.e. 'reboot' - all
>> works very well.
>>
>> Most time when I kill a node this works too - i.e. 'stop' on proxmox or
>> the Hetzner control panel.
>>
>> I have a long token time (20 seconds) to allow for short network outages
>> so that we don't reconfigure for every little glitch.
>>
>> Not fully reproducible but twice now I have killed the server which was
>> both DC and had the active services running and hit a problem. Most
>> times it works.
>>
>> The problem is that the remaining two nodes do not see the killed node
>> go offline (crm status shows everything online and quorum DC on the dead
>> node). Nothing works anymore e.g. crm resource cleanup xyz just hangs.
>>
> 
> Are you 100% sure it's corosync problem? I mean, can't it be pacemaker
> one? What version of pacemaker you are using? Are you using cpg based
> one, or plugin based one (this is known to be very problematic, that's
> why Andrew implemented cpg based one)?
> 
>> The corosync log however shows the old DC disappearing and the new DC
>> being negotiated correctly (to my eyes at least). But this doesn't
>> appear to have any effect.
>>
>> A final part of the bug is that corosync refuses to shutdown during the
>> reboot process - only a hard reboot works.
>>
> 
> Corosync (1.x) provides functionality to register shutdown callback and
> if application which uses this functionality refuses shutdown, corosync
> will refuse shutdown. Maybe this is the case.
> 
>> This is very similar to problems I've seen on live systems before we
>> stopped using wan links. I would love to get a fix for this as it
>> completely kills HA. When we are in this state nothing works until the
>> ops hard reboot all nodes.
>>
>> Question: what exactly do you need from me the next time this happens?
>>
> 
> - Please send me corosync.log from all nodes
> - corosync-objctl -a from all nodes (this will allow us to find out,
> what membership corosync sees).
> - From nodes which cannot be restarted result of corosync-blackbox
> 
> Regards,
>   Honza
> 
>> All the best
>>
>> Allan
>>
>>
>> On 21/08/13 14:20, Allan Latham wrote:
>>> Hi Honza
>>>
>>> I'd like to compile the latest (and best) version which could work as a
>>> drop in replacement for what is in Debian Wheezy:
>>>
>>> root@h89 /root # corosync -v
>>> Corosync Cluster Engine, version '1.4.2'
>>> Copyright (c) 2006-2009 Red Hat, Inc.
>>>
>>> root@h89 /root # dpkg -l |grep corosync
>>> ii  corosync                           1.4.2-3
>>> amd64        Standards-based cluster framework (daemon and modules)
>>>
>>> Which version do you recommend?
>>>
>>> Or is there a compatible .deb somewhere?
>>>
>>> In particular the bug with pointopoint and udpu is biting me!
>>>
>>> All the best
>>>
>>> Allan
>>>
>>>
>>
>>
> 
> 

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss