Re: conntrackd failover works partially, was Re: conntrack performance test results in INVALID packets

Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> · Thu, 04 Sep 2008 14:29:14 +0200

Hi Bernhard,

Bernhard Bock wrote:
> Pablo Neira Ayuso wrote:
>>>> Is the firewall sending RST packets to the peer/server to close
>>>> connections? If so, I remember a similar report with a RHEL kernel:
> 
> I do not see any RST packets, neither on server nor on client side.

Fine.

> I have done more tests this morning. Unfortunately, things are complicated:
> 
> I repeated a basic failover test lots of times while making 1.000.000
> connections. This test with 1000 parallel connections breaks every time.
> 500 is OK every time.

In my testbed [1], with a vanilla linux kernel 2.6.26.3 with no patching
at all, using conntrack-tools current git-snapshot, on debian using `ab'
- apache benchmark tool - to generate 1.000.000 connections with 1000
parallel connections - with log_invalid set off - I don't see any packet
hitting the log-invalid rule.

I noticed that ab reports ~3000 connection failures when triggering a
couple of fail-overs - to do so, my script sets one of the link down and
5 seconds later it set it up. However, the number of failures is similar
without triggering the fail-over, it is ~1500. I guess that the
connections are timing out after several resends but I notice nothing
abnormal since the packets don't hit the log-invalid rule.

Two comments that I have to do about my tests:

* I had to rise the default value of SocketBufferSize and
SocketBufferSizeMaxGrowth in conntrackd.conf to avoid netlink overflows
with such amount of traffic. There are log messages in conntrackd.log
that warn about this issue. Also, you can notice this if you observe
that conntrackd hits 100% CPU consumption at some point - this happens
when netlink overflows.
* Also, I had to rise the default value of McastSndSocketBuffer and
McastRcvSocketBuffer since I was noticing packets lost via conntrackd -s
- see multicast sequence tracking. This happens when the link gets
pretty congested because of

With these tweaks the results were good, conntrackd was consuming about
the same percetange of CPU than ksoftirqd (~25% each via top, which is
not very reliable but it's OK for an estimation). It would be possible
to reduce this CPU consumption even more by means of the Filter clause -
eg. only replicate TCP ESTABLISHED states.

These tweaks affect the behaviour of conntrackd so that they are worth
to give a try.

> The kernel only has problems with 1000 connections, and then only from
> time to time. In most of the cases (I guess ca. 80% of all tests), I do
> not need to unload/load the kernel modules, but only clear the conntrack
> table to get it back up running. The other times I have to reload the
> kernel modules in order to make the system work again. I cannot see any
> pattern there.

Nor me, and I cannot reproduce the problems that you're reporting.  More
questions to try to diagnose your problem:

1) does /var/log/conntrackd.log - or syslog - tells anything relevant?
Are the entries being comitted to kernel-space successfully?
2) Can you see the committed entries in the kernel via `conntrack -L'
after the fail-over?
3) Are you noticing any abnormal CPU consumption?

[1] http://conntrack-tools.netfilter.org/testcase.html

-- 
"Los honestos son inadaptados sociales" -- Les Luthiers
--
To unsubscribe from this list: send the line "unsubscribe netfilter" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html