Re: Corosync 1.4.6 locks up

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Allan,
sorry for late reply, but there were more urgent things.

Anyway,
> I wonder also if 5 servers in a ring with wan links is simply too
many. Do the possible problems increase by N**2?

If you were using 20 sec timeout it should just work. I mean, token
timeout should be set to sum of rtt between nodes in ring + token hold
timeout * no_nodes. It really shouldn't be N**2.

Drop-in replacement is always in same X.Y series, so 1.4.x it's 1.4.6,
same for 2.3.x it's 2.3.1. When Y is changing, everything works well in
most cases but there may be bigger change (like 1.3.x -> 1.4.x). When X
is changing (so 1.x.y -> 2.x.y) there are HUGE changes. For example
running 2.3.1 and 1.4.6 on same net (same broadcast addr, port, ...)
will simply not work.


Allan Latham napsal(a):
> Hi Honza
> 
> I decided for 1.4.6 as I already had the .deb packages I made for the
> ipv6 tests.
> 
> All works well except that when I kill the server running as DC which is
> also running the active HA servers.
> 
> The setup is two real servers running proxmox each with one KVM virutal
> machine.
> 
> Each VM runs corosync 1.4.6 and a virtual server from Hetzner runs the
> same thus giving me a 3 node corosync/pacemaker cluster. The VS from
> Hetzner never runs any real service - it's just for quorum purposes. The
> two KVMs run ip changeover, drbd and mysql.
> 
> I have done many tests with controlled handover - i.e. 'reboot' - all
> works very well.
> 
> Most time when I kill a node this works too - i.e. 'stop' on proxmox or
> the Hetzner control panel.
> 
> I have a long token time (20 seconds) to allow for short network outages
> so that we don't reconfigure for every little glitch.
> 
> Not fully reproducible but twice now I have killed the server which was
> both DC and had the active services running and hit a problem. Most
> times it works.
> 
> The problem is that the remaining two nodes do not see the killed node
> go offline (crm status shows everything online and quorum DC on the dead
> node). Nothing works anymore e.g. crm resource cleanup xyz just hangs.
> 

Are you 100% sure it's corosync problem? I mean, can't it be pacemaker
one? What version of pacemaker you are using? Are you using cpg based
one, or plugin based one (this is known to be very problematic, that's
why Andrew implemented cpg based one)?

> The corosync log however shows the old DC disappearing and the new DC
> being negotiated correctly (to my eyes at least). But this doesn't
> appear to have any effect.
> 
> A final part of the bug is that corosync refuses to shutdown during the
> reboot process - only a hard reboot works.
> 

Corosync (1.x) provides functionality to register shutdown callback and
if application which uses this functionality refuses shutdown, corosync
will refuse shutdown. Maybe this is the case.

> This is very similar to problems I've seen on live systems before we
> stopped using wan links. I would love to get a fix for this as it
> completely kills HA. When we are in this state nothing works until the
> ops hard reboot all nodes.
> 
> Question: what exactly do you need from me the next time this happens?
> 

- Please send me corosync.log from all nodes
- corosync-objctl -a from all nodes (this will allow us to find out,
what membership corosync sees).
- From nodes which cannot be restarted result of corosync-blackbox

Regards,
  Honza

> All the best
> 
> Allan
> 
> 
> On 21/08/13 14:20, Allan Latham wrote:
>> Hi Honza
>>
>> I'd like to compile the latest (and best) version which could work as a
>> drop in replacement for what is in Debian Wheezy:
>>
>> root@h89 /root # corosync -v
>> Corosync Cluster Engine, version '1.4.2'
>> Copyright (c) 2006-2009 Red Hat, Inc.
>>
>> root@h89 /root # dpkg -l |grep corosync
>> ii  corosync                           1.4.2-3
>> amd64        Standards-based cluster framework (daemon and modules)
>>
>> Which version do you recommend?
>>
>> Or is there a compatible .deb somewhere?
>>
>> In particular the bug with pointopoint and udpu is biting me!
>>
>> All the best
>>
>> Allan
>>
>>
> 
> 

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss




[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux