Hello,
I have recently set up a pair of Dell PowerEdge R610 servers (Xeon
X5650, 8GB RAM) for active-backup firewall duty. I've installed
conntrack-tools-1.0.1 and libnetfilter_conntrack-1.0.0 and am using the
FTFW mode for synchronization across a dedicated gigabit interface. The
active firewall has to contend with fairly heavy traffic, much of which
is in the form of long-lived TCP connections to an internal (LVS) load
balancer, behind which a bunch of application servers sit.
The number of active, concurrent connections to this service peaks at
around 480,000. At last count, the number of conntrack states was
785,785 which is typical. I have net.nf_conntrack_max set to 1048576 and
the nf_conntrack module is loaded with hashsize=262144. The firewall is
fully stateful in that new connections must match on -ctstate NEW. I'm
also using "-t raw -A PREROUTING -j CT --ctevents assured" as mentioned
in the docs.
This is my current test case for the backup:-
1) Boot the system and start conntrackd
2) Run conntrackd -n to sync with the active firewall
3) Run conntrackd -c to commit the states from the external cache
Originally, while conntrackd -c was performing its work, I would
experience protracted soft lockups. After some investigation, I noticed
that conntrackd was trying to more states than net.nf_conntrack_max
which, in turn, led me to this patch:-
https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=af14cca
Although Jozsef's patch was helpful, I'm still experiencing a nasty
kernel oops after conntrackd -c has finished executing. This always
occurs within 15 seconds or so - sometimes immediately. Here's a recent
netconsole trace from 3.3-rc5 + patch:-
http://paste.pocoo.org/raw/559736/
Though I ultimately intend to use the 3.0 kernel, I tried various other
versions going as far back as 2.6.32. In each case, an oops is
reproducible - though the details do vary. Using 3.3-rc5, I even noticed
a null ptr deref on one occcasion. Alas, I was unable to capture it at
the time.
Here's some other configuration information which may be useful ...
conntrackd.conf: http://paste.pocoo.org/raw/559727/
sysctl.conf: http://paste.pocoo.org/raw/559726/
kernel .config: http://paste.pocoo.org/raw/559725/
It's perhaps worth noting that I followed the advice to set HashLimit in
conntrackd.conf to at least double that of net.nf_conntrack_max
(commented in my config because I was experimenting with the issue that
Jozef's patch rectifies). One thing that puzzles me is why conntrackd
always tries to commit more state entries than can be accommodated. On
the master, the internal cache grows to the maximum size and, afaict,
nothing is ever expired. This is from the master which has been up for a
while ...
# conntrackd -s | head -n 5
cache internal:
current active connections: 2097152
connections created: 31649757 failed: 234788761
connections updated: 105516073 failed: 0
connections destroyed: 29552605 failed: 0
# conntrack -S | head -n1
entries 792495
It seems that the cache usage grows to the maximum, at which point the
creation failed counter starts going skyward. On the backup, it seems
that conntrackd -n && conntrackd -c tries to commit all of this, but I
don't really understand why.
Any advice would be most welcome. I can't tinker too much with the
active firewall at this point but, if it helps, I can conduct any number
of tests with the backup.
Cheers,
--Kerin
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html