On 26/08/2013, at 9:36 AM, Andrew Beekhof <andrew@xxxxxxxxxxx> wrote: > > On 23/08/2013, at 8:53 PM, Allan Latham <alatham@xxxxxxxxxxxxxxxx> wrote: > >> Hi Honza >> >> It may of course be a pacemaker problem. I have no idea where to look >> for clues -:) >> >>> What version of pacemaker you are using? Are you using cpg based >>> one, or plugin based one >> >> root@v100 /root # dpkg -l|grep pacemaker >> ii pacemaker 1.1.7-1 amd64 HA cluster resource manager >> >> I have no idea if it's cpg or plugin. We like to stick with whatever >> comes with the latest Debian unless there is some very good reason not >> to - simply to keep future maintenance to a minimum. > > It depends on your configuration, not necessarily how pacemaker was built. > If you have a block like: > > service { > name: pacemaker > ver:.... > } > > Then its a plugin based cluster. > >> >> The Debian expermental/unstable version is 1.1.10+ but has 9 pages of >> dependencies - I really don't want to go there. > > That sounds... strange. The upstream dependancy list didn't really change in that timeframe. > >> >> Any ideas how I can get pacemaker diagnostics > > man crm_report :) Also: http://blog.clusterlabs.org/blog/2013/debugging-pacemaker/ and http://blog.clusterlabs.org/blog/2013/pacemaker-logging/ > >> - the corosync log seems >> to contain some items which I would think come from pacemaker. Is there >> anything else to look at? Is there a 'debug' level? > > Yes. We read it from corosync (man corosync.conf) > >> >> All the best and many many thanks for all the help. >> >> Allan >> >> >> On 23/08/13 11:08, Jan Friesse wrote: >>> Allan, >>> sorry for late reply, but there were more urgent things. >>> >>> Anyway, >>>> I wonder also if 5 servers in a ring with wan links is simply too >>> many. Do the possible problems increase by N**2? >>> >>> If you were using 20 sec timeout it should just work. I mean, token >>> timeout should be set to sum of rtt between nodes in ring + token hold >>> timeout * no_nodes. It really shouldn't be N**2. >>> >>> Drop-in replacement is always in same X.Y series, so 1.4.x it's 1.4.6, >>> same for 2.3.x it's 2.3.1. When Y is changing, everything works well in >>> most cases but there may be bigger change (like 1.3.x -> 1.4.x). When X >>> is changing (so 1.x.y -> 2.x.y) there are HUGE changes. For example >>> running 2.3.1 and 1.4.6 on same net (same broadcast addr, port, ...) >>> will simply not work. >>> >>> >>> Allan Latham napsal(a): >>>> Hi Honza >>>> >>>> I decided for 1.4.6 as I already had the .deb packages I made for the >>>> ipv6 tests. >>>> >>>> All works well except that when I kill the server running as DC which is >>>> also running the active HA servers. >>>> >>>> The setup is two real servers running proxmox each with one KVM virutal >>>> machine. >>>> >>>> Each VM runs corosync 1.4.6 and a virtual server from Hetzner runs the >>>> same thus giving me a 3 node corosync/pacemaker cluster. The VS from >>>> Hetzner never runs any real service - it's just for quorum purposes. The >>>> two KVMs run ip changeover, drbd and mysql. >>>> >>>> I have done many tests with controlled handover - i.e. 'reboot' - all >>>> works very well. >>>> >>>> Most time when I kill a node this works too - i.e. 'stop' on proxmox or >>>> the Hetzner control panel. >>>> >>>> I have a long token time (20 seconds) to allow for short network outages >>>> so that we don't reconfigure for every little glitch. >>>> >>>> Not fully reproducible but twice now I have killed the server which was >>>> both DC and had the active services running and hit a problem. Most >>>> times it works. >>>> >>>> The problem is that the remaining two nodes do not see the killed node >>>> go offline (crm status shows everything online and quorum DC on the dead >>>> node). Nothing works anymore e.g. crm resource cleanup xyz just hangs. >>>> >>> >>> Are you 100% sure it's corosync problem? I mean, can't it be pacemaker >>> one? What version of pacemaker you are using? Are you using cpg based >>> one, or plugin based one (this is known to be very problematic, that's >>> why Andrew implemented cpg based one)? >>> >>>> The corosync log however shows the old DC disappearing and the new DC >>>> being negotiated correctly (to my eyes at least). But this doesn't >>>> appear to have any effect. >>>> >>>> A final part of the bug is that corosync refuses to shutdown during the >>>> reboot process - only a hard reboot works. >>>> >>> >>> Corosync (1.x) provides functionality to register shutdown callback and >>> if application which uses this functionality refuses shutdown, corosync >>> will refuse shutdown. Maybe this is the case. >>> >>>> This is very similar to problems I've seen on live systems before we >>>> stopped using wan links. I would love to get a fix for this as it >>>> completely kills HA. When we are in this state nothing works until the >>>> ops hard reboot all nodes. >>>> >>>> Question: what exactly do you need from me the next time this happens? >>>> >>> >>> - Please send me corosync.log from all nodes >>> - corosync-objctl -a from all nodes (this will allow us to find out, >>> what membership corosync sees). >>> - From nodes which cannot be restarted result of corosync-blackbox >>> >>> Regards, >>> Honza >>> >>>> All the best >>>> >>>> Allan >>>> >>>> >>>> On 21/08/13 14:20, Allan Latham wrote: >>>>> Hi Honza >>>>> >>>>> I'd like to compile the latest (and best) version which could work as a >>>>> drop in replacement for what is in Debian Wheezy: >>>>> >>>>> root@h89 /root # corosync -v >>>>> Corosync Cluster Engine, version '1.4.2' >>>>> Copyright (c) 2006-2009 Red Hat, Inc. >>>>> >>>>> root@h89 /root # dpkg -l |grep corosync >>>>> ii corosync 1.4.2-3 >>>>> amd64 Standards-based cluster framework (daemon and modules) >>>>> >>>>> Which version do you recommend? >>>>> >>>>> Or is there a compatible .deb somewhere? >>>>> >>>>> In particular the bug with pointopoint and udpu is biting me! >>>>> >>>>> All the best >>>>> >>>>> Allan >>>>> >>>>> >>>> >>>> >>> >>> >> >> _______________________________________________ >> discuss mailing list >> discuss@xxxxxxxxxxxx >> http://lists.corosync.org/mailman/listinfo/discuss >
Attachment:
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss