Re: Corosync 1.4.6 locks up

Andrew Beekhof <andrew@xxxxxxxxxxx> · Mon, 26 Aug 2013 09:43:33 +1000

On 26/08/2013, at 9:36 AM, Andrew Beekhof <andrew@xxxxxxxxxxx> wrote:

> 
> On 23/08/2013, at 8:53 PM, Allan Latham <alatham@xxxxxxxxxxxxxxxx> wrote:
> 
>> Hi Honza
>> 
>> It may of course be a pacemaker problem. I have no idea where to look
>> for clues -:)
>> 
>>> What version of pacemaker you are using? Are you using cpg based
>>> one, or plugin based one
>> 
>> root@v100 /root # dpkg -l|grep pacemaker
>> ii  pacemaker 1.1.7-1 amd64 HA cluster resource manager
>> 
>> I have no idea if it's cpg or plugin. We like to stick with whatever
>> comes with the latest Debian unless there is some very good reason not
>> to - simply to keep future maintenance to a minimum.
> 
> It depends on your configuration, not necessarily how pacemaker was built.
> If you have a block like:
> 
> service {
> 	name: pacemaker
> 	ver:....
> }
> 
> Then its a plugin based cluster.
> 
>> 
>> The Debian expermental/unstable version is 1.1.10+ but has 9 pages of
>> dependencies - I really don't want to go there.
> 
> That sounds... strange. The upstream dependancy list didn't really change in that timeframe.
> 
>> 
>> Any ideas how I can get pacemaker diagnostics
> 
> man crm_report :)

Also: 
   http://blog.clusterlabs.org/blog/2013/debugging-pacemaker/
and
   http://blog.clusterlabs.org/blog/2013/pacemaker-logging/

> 
>> - the corosync log seems
>> to contain some items which I would think come from pacemaker. Is there
>> anything else to look at? Is there a 'debug' level?
> 
> Yes. We read it from corosync (man corosync.conf)
> 
>> 
>> All the best and many many thanks for all the help.
>> 
>> Allan
>> 
>> 
>> On 23/08/13 11:08, Jan Friesse wrote:
>>> Allan,
>>> sorry for late reply, but there were more urgent things.
>>> 
>>> Anyway,
>>>> I wonder also if 5 servers in a ring with wan links is simply too
>>> many. Do the possible problems increase by N**2?
>>> 
>>> If you were using 20 sec timeout it should just work. I mean, token
>>> timeout should be set to sum of rtt between nodes in ring + token hold
>>> timeout * no_nodes. It really shouldn't be N**2.
>>> 
>>> Drop-in replacement is always in same X.Y series, so 1.4.x it's 1.4.6,
>>> same for 2.3.x it's 2.3.1. When Y is changing, everything works well in
>>> most cases but there may be bigger change (like 1.3.x -> 1.4.x). When X
>>> is changing (so 1.x.y -> 2.x.y) there are HUGE changes. For example
>>> running 2.3.1 and 1.4.6 on same net (same broadcast addr, port, ...)
>>> will simply not work.
>>> 
>>> 
>>> Allan Latham napsal(a):
>>>> Hi Honza
>>>> 
>>>> I decided for 1.4.6 as I already had the .deb packages I made for the
>>>> ipv6 tests.
>>>> 
>>>> All works well except that when I kill the server running as DC which is
>>>> also running the active HA servers.
>>>> 
>>>> The setup is two real servers running proxmox each with one KVM virutal
>>>> machine.
>>>> 
>>>> Each VM runs corosync 1.4.6 and a virtual server from Hetzner runs the
>>>> same thus giving me a 3 node corosync/pacemaker cluster. The VS from
>>>> Hetzner never runs any real service - it's just for quorum purposes. The
>>>> two KVMs run ip changeover, drbd and mysql.
>>>> 
>>>> I have done many tests with controlled handover - i.e. 'reboot' - all
>>>> works very well.
>>>> 
>>>> Most time when I kill a node this works too - i.e. 'stop' on proxmox or
>>>> the Hetzner control panel.
>>>> 
>>>> I have a long token time (20 seconds) to allow for short network outages
>>>> so that we don't reconfigure for every little glitch.
>>>> 
>>>> Not fully reproducible but twice now I have killed the server which was
>>>> both DC and had the active services running and hit a problem. Most
>>>> times it works.
>>>> 
>>>> The problem is that the remaining two nodes do not see the killed node
>>>> go offline (crm status shows everything online and quorum DC on the dead
>>>> node). Nothing works anymore e.g. crm resource cleanup xyz just hangs.
>>>> 
>>> 
>>> Are you 100% sure it's corosync problem? I mean, can't it be pacemaker
>>> one? What version of pacemaker you are using? Are you using cpg based
>>> one, or plugin based one (this is known to be very problematic, that's
>>> why Andrew implemented cpg based one)?
>>> 
>>>> The corosync log however shows the old DC disappearing and the new DC
>>>> being negotiated correctly (to my eyes at least). But this doesn't
>>>> appear to have any effect.
>>>> 
>>>> A final part of the bug is that corosync refuses to shutdown during the
>>>> reboot process - only a hard reboot works.
>>>> 
>>> 
>>> Corosync (1.x) provides functionality to register shutdown callback and
>>> if application which uses this functionality refuses shutdown, corosync
>>> will refuse shutdown. Maybe this is the case.
>>> 
>>>> This is very similar to problems I've seen on live systems before we
>>>> stopped using wan links. I would love to get a fix for this as it
>>>> completely kills HA. When we are in this state nothing works until the
>>>> ops hard reboot all nodes.
>>>> 
>>>> Question: what exactly do you need from me the next time this happens?
>>>> 
>>> 
>>> - Please send me corosync.log from all nodes
>>> - corosync-objctl -a from all nodes (this will allow us to find out,
>>> what membership corosync sees).
>>> - From nodes which cannot be restarted result of corosync-blackbox
>>> 
>>> Regards,
>>> Honza
>>> 
>>>> All the best
>>>> 
>>>> Allan
>>>> 
>>>> 
>>>> On 21/08/13 14:20, Allan Latham wrote:
>>>>> Hi Honza
>>>>> 
>>>>> I'd like to compile the latest (and best) version which could work as a
>>>>> drop in replacement for what is in Debian Wheezy:
>>>>> 
>>>>> root@h89 /root # corosync -v
>>>>> Corosync Cluster Engine, version '1.4.2'
>>>>> Copyright (c) 2006-2009 Red Hat, Inc.
>>>>> 
>>>>> root@h89 /root # dpkg -l |grep corosync
>>>>> ii  corosync                           1.4.2-3
>>>>> amd64        Standards-based cluster framework (daemon and modules)
>>>>> 
>>>>> Which version do you recommend?
>>>>> 
>>>>> Or is there a compatible .deb somewhere?
>>>>> 
>>>>> In particular the bug with pointopoint and udpu is biting me!
>>>>> 
>>>>> All the best
>>>>> 
>>>>> Allan
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> _______________________________________________
>> discuss mailing list
>> discuss@xxxxxxxxxxxx
>> http://lists.corosync.org/mailman/listinfo/discuss
> 

Attachment:
signature.asc

Description: Message signed with OpenPGP using GPGMail
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss