[Fwd: Re: [netfilter-core] iptables/conntrack in enterpriseenvironment.]

"Preston A. Elder" <prez@xxxxxxxxxxxxx> · 04 Jun 2003 16:18:52 -0400

Hi,

While waiting for a response to the email I sent below, I went ahead and
investigated the 3rd 'question' I raised in that email.

I essentially removed the checking of ip_nat_used_tuple for DNAT (which
appplies to REDIRECT entries too) entries.

I changed this code in get_unique_tuple (ip_nat_core.c) from this:
                if ((!(rptr->flags & IP_NAT_RANGE_PROTO_SPECIFIED)
                     || proto->in_range(tuple, HOOK2MANIP(hooknum),
                                        &rptr->min, &rptr->max))
                    && !ip_nat_used_tuple(tuple, conntrack)) {
                        ret = 1;
                        goto clear_fulls;
                } else {

to this:
                if ((!(rptr->flags & IP_NAT_RANGE_PROTO_SPECIFIED)
                     || proto->in_range(tuple, HOOK2MANIP(hooknum),
                                        &rptr->min, &rptr->max))
                    && (HOOK2MANIP(hooknum) == IP_NAT_MANIP_DST ? 1 :
!ip_nat_used_tuple(tuple, conntrack))) {
                        ret = 1;
                        goto clear_fulls;
                } else {

I commented out the ASSERT's just after the proto->unique_tuple calls in
get_unique_tuple (ip_nat_core.c) aswell, the lines that look like this:
                                        IP_NF_ASSERT(!ip_nat_used_tuple
                                                     (tuple,
conntrack));

And changed this code in tcp_unique_tuple (ip_nat_proto_tcp.c) from
this:
        for (i = 0; i < range_size; i++, port++) {
                *portptr = htons(min + port % range_size);
                if (!ip_nat_used_tuple(tuple, conntrack)) {
                        return 1;
                }
        }

to this:
        if (maniptype == IP_NAT_MANIP_DST)
        {
                *portptr = htons(min + net_random() % range_size);
                return 1;
        }
        else
        {
                start = net_random() % range_size;
                port += start;

                for (i = start; i < range_size; i++, port++) {
                        *portptr = htons(min + port % range_size);
                        if (!ip_nat_used_tuple(tuple, conntrack)) {
                                return 1;
                        }
                }
                if (i == range_size)
                {
                        port -= range_size;
                        for (i = 0; i < start; i++, port++) {
                                *portptr = htons(min + port %
range_size);
                                if (!ip_nat_used_tuple(tuple,
conntrack)) {
                                        return 1;
                                }
                        }
                }
        }

I only have one rule in the entire NAT table, the one that forwards all
new connections to ports for machines behind it to a specific port range
on the local machine (which is in the PREROUTING 'chain').

This change DOES seem to have the desired effect, of making connections
fully establish pretty much immediately, and as suspected, since the
socket on the local machine is just a listening socket, it really does
not care about multiple connections, and thus does not need the 'in use'
checking above.  However, after putting this in place, the system seems
to stop functioning (I'm not sure if its just the network, or the system
itself, since I'm not at the console, however I suspect its the system
itself, as its a very sudden freeze).

Could someone shed some light into why the system would freeze after a
short period of time (less than 5 minutes) with this code running (note:
it only freezes when our application is running, ie. there is something
there to accept connections).  And also possibly shed some light on
possible side-effects the above modifications could have (apart from
freezing the system)?  I don't usually screw around with the kernel
(though I have before), so this is relatively new territory for me.

Any and all help, comments, etc. appreciated.

Thanks,

PreZ :)
--- Begin Message ---

To: Harald Welte <laforge@xxxxxxxxxxxxx>
Subject: Re: [netfilter-core] iptables/conntrack in enterprise environment.
From: "Preston A. Elder" <prez@xxxxxxxxxxxxx>
Date: Fri, 30 May 2003 20:55:25 -0400
Cc: netfilter@xxxxxxxxxxxxxxxxxxx, netfilter-devel@xxxxxxxxxxxxxxxxxxx, coreteam@xxxxxxxxxxxxx
In-reply-to: <20030530194240.GI29312@sunbeam.de.gnumonks.org>
Organization: Shadow Realm
References: <200305290113.58552.prez@srealm.net.au> <200305300133.44822.prez@srealm.net.au> <20030530194240.GI29312@sunbeam.de.gnumonks.org>
User-agent: KMail/1.5.2

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Friday 30 May 2003 03:42 pm, Harald Welte wrote:
> > If I then telnet to 10.0.0.1 on port 5050, the connection is immediate,
> > and my application receives a new connection on port 5050.  If, however,
> > I telnet to 10.0.0.1 on port 5150, there is a small (but noticable) delay
> > between when the telnet session shows the connection as established, and
> > when the application receives the connection (on a random port between
> > 5000 and 5100 inclusive).
>
> This is _definitely_ not a netfilter issue then.  We are not doing
> transparent proxying but nat.  netfilter/iptables _never_ accept
> connections on their own.  So you open a connection:
Maybe not, but thats the behavior I'm seeing.  But this doesnt bother me so 
much as it takes a LOT longer for a connection to be established when 
connecing to a port outside the range, than to one inside the range.

> 1. telnet 10.0.0.1 5150
> 2. syn packet is sent by telnet
> 3. syn packet is DNAT'ed by netfilter
> 4. syn packet arrives at server application
> 5. syn/ack packet is sent by server application
> 6. syn/ack packet is SNAT'ed by netfilter
> 7. syn/ack packet is received by telnet
>  [further handshake goes on]
> x. telnet application prints 'Connection established' (connect(2) call
>    returns)
>
> This is a fundamental tcp/ip operation, and it can certainly by no way
> be anything that netfilter does, that would introduce a behaviour like
>
> 1. telnet 10.0.01 5150
> 2. telnet shows 'connection established'
> 3. connection 'arrives' at server
>
> [which is what you have been describing, If I understood you correctly].
I realise this is supposed to happen, however again, its not what I'm 
observing, though its only noticable under high loads, so maybe its taking a 
long time to get the initial SYN and ACK to the server from telnet (which 
would account for both the delay before connection is established by telnet 
(delay for SYN/ACK to get back from server), and the delay between this and 
the server showing the connection established (which only happens after it 
gets the ACK).

OK, so with this information, its taking a LONG time to get the SYN and ACK to 
the server when I'm connecting to a port outside the range I'm listening on 
with the application, causing the appearance of the port being connected by 
telnet a long time before the server shows it being established.  In any 
case, this doesnt change the problem.  This delay does *NOT* exist when 
connecting to a port that is in the listening range (remember that all 
connections being discussed here are going to systems BEHIND the router box).

I did some kernel diving, I've iscolated the place where the behavior 
differentiates between a connection 'in' and 'out' of the port range, at this 
if statement in get_unique_tuple (ip_nat_core.c):
               if ((!(rptr->flags & IP_NAT_RANGE_PROTO_SPECIFIED)
                     || proto->in_range(tuple, HOOK2MANIP(hooknum),
                                        &rptr->min, &rptr->max))
                    && !ip_nat_used_tuple(tuple, conntrack)) {

Obviously, anything connecting to a port outside this range will fail the 
above if.  If this fails, it will lead us to do this (one way or another):
                       if (proto->unique_tuple(tuple, rptr,
                                                HOOK2MANIP(hooknum),
                                                conntrack)) {

The unique_tuple function is where the delay is.  In this case, its a TCP 
connection, which translates to tcp_unique_tuple (ip_nat_proto_tcp.c), which 
after determining the port range, does the following:
        for (i = 0; i < range_size; i++, port++) {
                *portptr = htons(min + port % range_size);
                if (!ip_nat_used_tuple(tuple, conntrack)) {
                        return 1;
                }
        }

Now I'm assuming it will use this:
                min = ntohs(range->min.tcp.port);
                range_size = ntohs(range->max.tcp.port) - min + 1;
rather than this:
                        min = 1024;
                        range_size = 65535 - 1024 + 1;
for my port range, since I specified a local port range, however either way, 
the range is about 100 ports.  Which means the above nat used lookup happens 
100 times (as opposed to once for when the port is inside the range).  And 
since ip_nat_used_tuple (ip_nat_core.c) calls ip_conntrack_tuple_taken 
(ip_conntrack_core.c), which subsequently locks/unlocks the entire conntrack 
table, and calls __ip_conntrack_find (which is a list lookup!), the above for 
loop is a very intensive process.

Now, after all that.  A few questions arise:
1) Is there a way the above for loop could lock the conntrack table 
beforehand, do the searches (not locked) and unlock afterwards.  This is 
probably the biggest thing consuming all the speed, constant 
locking/unlocking, especially when there are many new socket connections 
coming in every second.

2) Is there a way the above for loop could use a random 'start' place, search 
through to the end, and then start at the beginning, until it ends up back at 
the same place it started from, which would avoid it constantly treading the 
same ground (if the last entry used 1, its pretty assured 1 is still going to 
be in use when you get the next entry, so you could either skip directly to 
2, or randomly choose a new position to start at). (ie. something like:
	int port, start_port = (rand() % range) + min_port);
	for (port=start_port; port<min_port + range; port++)
		{ if (!used) { return 1; } }
	if (port == min_port + range)
		for (port = min_port; port < start_port; port++)
			{ if (!used) { return 1; } }
	return 0;
).  Obviously, the kernel's rand would need to be used, etc.

3) Why does it need to do the unique check in the first place (for DNAT and 
REDIRECT), since this code should only be triggered on a new connection, and 
if I have 2 new connections from the same source, to the same destination and 
the same destination (nat'd) port, who cares?  The destination port has to be 
listening for it anyway, and theres nothing stopping it accepting 2 
connections from the same source to the same local port (the source port will 
always be different).  I'm assuming the 'resend' checks are done before this 
point anyway (using the ORIGINAL destination ip/port, not the nat'd one) in 
the system's TCP stack.

Anyway, if I'm way off base here, let me know, but I'm just trying to figure 
out why there is a difference between connecting to a port inside, and a port 
outside the range I'm listening to on the router.  Its not a small difference 
(time wise) either.

And re: your other email:
1) I know SNATs is not a TCP state, the data was copied from a file I generate 
every 5 minutes gathering this data, and that was the line above what I ment 
to copy.
2) When I said 'without state', I mean UDP connections, since they don't have 
TCP state information.

- -- 
PreZ
Systems Administrator
Shadow Realm

PGP FingerPrint: B3 0C F3 32 DE 5A 7D 90  26 F6 FA 38 CC 0A 2D D8
Finger prez@xxxxxxxxxxxxx for full PGP public key.

Shadow Realm, a hobbyist ISP supplying real internet services.
http://www.srealm.net.au
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQE+1/2LKFp14D8AGEQRAhvtAKCGLi7R2DhiAGbluUhSxRBcuViR9wCfdU4J
+ycatke7LILAa+xRYDPOb7c=
=Ngon
-----END PGP SIGNATURE-----

--- End Message ---