Bernhard Bock wrote: > Pablo Neira Ayuso wrote: >> I though that your problem was that you cannot even recover the flows in >> the first failover, but it seems to me that you have triggered several >> fail-overs between the nodes. There's no way to hit this in a clean >> session - ie. empty connection tracking table. > > Well, there are several thousand connections established and teared down > on the primary node before the secondary nodes takes over, but as far as > I can tell there is no "bouncing" between the nodes. So, there's no > empty connection tracking table at failover time: > > 1. Stop conntrackd > 2. Clear conntrack table > 3. Restart Fedora iptables service (see below) > 4. Start conntrackd > -> 0 connections > 5. Start traffic > -> lots of connections > 6. fail-over OK >> If you are triggering several fail-overs with unclean session, the new >> script should help. So please, give it a try. It will take you a couple >> of minutes to get it working. > > Your script makes things worse for me, as it drops a lot of traffic on > switchover. Hm, the new script does exactly the same when the node becomes primary as it used to do script_master.sh, so I cannot find a reason why the new script does it worst. > In my setup, it helps a lot to let INVALID packets pass for a couple of > seconds after switchover and return to the “normal” policy only after > this time. I coded this into my keepalived scripts. During this time, > some state recovers and most of the sessions actually work afterwards. This is a horrible workaround :( > With a “hard” failover, nearly all sessions get lost. During the fail-over, keepalived recovers the virtual IPs and conntrackd commits the states into the kernel. The commit takes very short but you can still lose some packets if the state is not yet present in the kernel - thus, these packets are logged as invalid and dropped as we don't find any matching state (with a sane stateful rule-set, of course). *However*, the TCP sessions should recover as the peer or the server retransmits the packet in short, so I don't understand why you lose nearly all the sessions. Is the firewall sending RST packets to the peer/server to close connections? If so, I remember a similar report with a RHEL kernel: http://www.mail-archive.com/netfilter-failover@xxxxxxxxxxxxxxxxxxx/msg00065.html > One more thing I just noticed: It is not sufficient to clear the > conntrack table with 'conntrack -F'. I have to unload and reload the > iptables kernel modules to make it work again. This is done by the > Fedora init scripts for iptables. Without this, after a "broken" > fail-over, the machine keeps dropping some (few) packets even without > conntrackd and a second node involved. After reloading the modules, > everything's fine again. I guess this hints towards searching in the > kernel space and not in the conntrack-tools?! conntrack -F should be enough, there's something wrong in the kernel. There were other issues related with nat. There are three patches that should hit -stable for 2.6.26 soon that are not directly related but that are worth to have: http://marc.info/?l=netfilter-devel&m=121907870404717&w=2 http://marc.info/?l=netfilter-devel&m=121907870504722&w=2 http://marc.info/?l=netfilter-devel&m=121907870604726&w=2 There were other issues related with NAT but they are fixed in 2.6.26, however, I'm not sure if fedora is a real 2.6.26 kernel. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/239215 -- "Los honestos son inadaptados sociales" -- Les Luthiers -- To unsubscribe from this list: send the line "unsubscribe netfilter" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html