Hi Honza It may of course be a pacemaker problem. I have no idea where to look for clues -:) > What version of pacemaker you are using? Are you using cpg based > one, or plugin based one root@v100 /root # dpkg -l|grep pacemaker ii pacemaker 1.1.7-1 amd64 HA cluster resource manager I have no idea if it's cpg or plugin. We like to stick with whatever comes with the latest Debian unless there is some very good reason not to - simply to keep future maintenance to a minimum. The Debian expermental/unstable version is 1.1.10+ but has 9 pages of dependencies - I really don't want to go there. Any ideas how I can get pacemaker diagnostics - the corosync log seems to contain some items which I would think come from pacemaker. Is there anything else to look at? Is there a 'debug' level? All the best and many many thanks for all the help. Allan On 23/08/13 11:08, Jan Friesse wrote: > Allan, > sorry for late reply, but there were more urgent things. > > Anyway, >> I wonder also if 5 servers in a ring with wan links is simply too > many. Do the possible problems increase by N**2? > > If you were using 20 sec timeout it should just work. I mean, token > timeout should be set to sum of rtt between nodes in ring + token hold > timeout * no_nodes. It really shouldn't be N**2. > > Drop-in replacement is always in same X.Y series, so 1.4.x it's 1.4.6, > same for 2.3.x it's 2.3.1. When Y is changing, everything works well in > most cases but there may be bigger change (like 1.3.x -> 1.4.x). When X > is changing (so 1.x.y -> 2.x.y) there are HUGE changes. For example > running 2.3.1 and 1.4.6 on same net (same broadcast addr, port, ...) > will simply not work. > > > Allan Latham napsal(a): >> Hi Honza >> >> I decided for 1.4.6 as I already had the .deb packages I made for the >> ipv6 tests. >> >> All works well except that when I kill the server running as DC which is >> also running the active HA servers. >> >> The setup is two real servers running proxmox each with one KVM virutal >> machine. >> >> Each VM runs corosync 1.4.6 and a virtual server from Hetzner runs the >> same thus giving me a 3 node corosync/pacemaker cluster. The VS from >> Hetzner never runs any real service - it's just for quorum purposes. The >> two KVMs run ip changeover, drbd and mysql. >> >> I have done many tests with controlled handover - i.e. 'reboot' - all >> works very well. >> >> Most time when I kill a node this works too - i.e. 'stop' on proxmox or >> the Hetzner control panel. >> >> I have a long token time (20 seconds) to allow for short network outages >> so that we don't reconfigure for every little glitch. >> >> Not fully reproducible but twice now I have killed the server which was >> both DC and had the active services running and hit a problem. Most >> times it works. >> >> The problem is that the remaining two nodes do not see the killed node >> go offline (crm status shows everything online and quorum DC on the dead >> node). Nothing works anymore e.g. crm resource cleanup xyz just hangs. >> > > Are you 100% sure it's corosync problem? I mean, can't it be pacemaker > one? What version of pacemaker you are using? Are you using cpg based > one, or plugin based one (this is known to be very problematic, that's > why Andrew implemented cpg based one)? > >> The corosync log however shows the old DC disappearing and the new DC >> being negotiated correctly (to my eyes at least). But this doesn't >> appear to have any effect. >> >> A final part of the bug is that corosync refuses to shutdown during the >> reboot process - only a hard reboot works. >> > > Corosync (1.x) provides functionality to register shutdown callback and > if application which uses this functionality refuses shutdown, corosync > will refuse shutdown. Maybe this is the case. > >> This is very similar to problems I've seen on live systems before we >> stopped using wan links. I would love to get a fix for this as it >> completely kills HA. When we are in this state nothing works until the >> ops hard reboot all nodes. >> >> Question: what exactly do you need from me the next time this happens? >> > > - Please send me corosync.log from all nodes > - corosync-objctl -a from all nodes (this will allow us to find out, > what membership corosync sees). > - From nodes which cannot be restarted result of corosync-blackbox > > Regards, > Honza > >> All the best >> >> Allan >> >> >> On 21/08/13 14:20, Allan Latham wrote: >>> Hi Honza >>> >>> I'd like to compile the latest (and best) version which could work as a >>> drop in replacement for what is in Debian Wheezy: >>> >>> root@h89 /root # corosync -v >>> Corosync Cluster Engine, version '1.4.2' >>> Copyright (c) 2006-2009 Red Hat, Inc. >>> >>> root@h89 /root # dpkg -l |grep corosync >>> ii corosync 1.4.2-3 >>> amd64 Standards-based cluster framework (daemon and modules) >>> >>> Which version do you recommend? >>> >>> Or is there a compatible .deb somewhere? >>> >>> In particular the bug with pointopoint and udpu is biting me! >>> >>> All the best >>> >>> Allan >>> >>> >> >> > > _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss