Hi,
While waiting for a response to the email I sent below, I went ahead and
investigated the 3rd 'question' I raised in that email.
I essentially removed the checking of ip_nat_used_tuple for DNAT (which
appplies to REDIRECT entries too) entries.
I changed this code in get_unique_tuple (ip_nat_core.c) from this:
if ((!(rptr->flags & IP_NAT_RANGE_PROTO_SPECIFIED)
|| proto->in_range(tuple, HOOK2MANIP(hooknum),
&rptr->min, &rptr->max))
&& !ip_nat_used_tuple(tuple, conntrack)) {
ret = 1;
goto clear_fulls;
} else {
to this:
if ((!(rptr->flags & IP_NAT_RANGE_PROTO_SPECIFIED)
|| proto->in_range(tuple, HOOK2MANIP(hooknum),
&rptr->min, &rptr->max))
&& (HOOK2MANIP(hooknum) == IP_NAT_MANIP_DST ? 1 :
!ip_nat_used_tuple(tuple, conntrack))) {
ret = 1;
goto clear_fulls;
} else {
I commented out the ASSERT's just after the proto->unique_tuple calls in
get_unique_tuple (ip_nat_core.c) aswell, the lines that look like this:
IP_NF_ASSERT(!ip_nat_used_tuple
(tuple,
conntrack));
And changed this code in tcp_unique_tuple (ip_nat_proto_tcp.c) from
this:
for (i = 0; i < range_size; i++, port++) {
*portptr = htons(min + port % range_size);
if (!ip_nat_used_tuple(tuple, conntrack)) {
return 1;
}
}
to this:
if (maniptype == IP_NAT_MANIP_DST)
{
*portptr = htons(min + net_random() % range_size);
return 1;
}
else
{
start = net_random() % range_size;
port += start;
for (i = start; i < range_size; i++, port++) {
*portptr = htons(min + port % range_size);
if (!ip_nat_used_tuple(tuple, conntrack)) {
return 1;
}
}
if (i == range_size)
{
port -= range_size;
for (i = 0; i < start; i++, port++) {
*portptr = htons(min + port %
range_size);
if (!ip_nat_used_tuple(tuple,
conntrack)) {
return 1;
}
}
}
}
I only have one rule in the entire NAT table, the one that forwards all
new connections to ports for machines behind it to a specific port range
on the local machine (which is in the PREROUTING 'chain').
This change DOES seem to have the desired effect, of making connections
fully establish pretty much immediately, and as suspected, since the
socket on the local machine is just a listening socket, it really does
not care about multiple connections, and thus does not need the 'in use'
checking above. However, after putting this in place, the system seems
to stop functioning (I'm not sure if its just the network, or the system
itself, since I'm not at the console, however I suspect its the system
itself, as its a very sudden freeze).
Could someone shed some light into why the system would freeze after a
short period of time (less than 5 minutes) with this code running (note:
it only freezes when our application is running, ie. there is something
there to accept connections). And also possibly shed some light on
possible side-effects the above modifications could have (apart from
freezing the system)? I don't usually screw around with the kernel
(though I have before), so this is relatively new territory for me.
Any and all help, comments, etc. appreciated.
Thanks,
PreZ :)
--- Begin Message ---
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On Friday 30 May 2003 03:42 pm, Harald Welte wrote:
> > If I then telnet to 10.0.0.1 on port 5050, the connection is immediate,
> > and my application receives a new connection on port 5050. If, however,
> > I telnet to 10.0.0.1 on port 5150, there is a small (but noticable) delay
> > between when the telnet session shows the connection as established, and
> > when the application receives the connection (on a random port between
> > 5000 and 5100 inclusive).
>
> This is _definitely_ not a netfilter issue then. We are not doing
> transparent proxying but nat. netfilter/iptables _never_ accept
> connections on their own. So you open a connection:
Maybe not, but thats the behavior I'm seeing. But this doesnt bother me so
much as it takes a LOT longer for a connection to be established when
connecing to a port outside the range, than to one inside the range.
> 1. telnet 10.0.0.1 5150
> 2. syn packet is sent by telnet
> 3. syn packet is DNAT'ed by netfilter
> 4. syn packet arrives at server application
> 5. syn/ack packet is sent by server application
> 6. syn/ack packet is SNAT'ed by netfilter
> 7. syn/ack packet is received by telnet
> [further handshake goes on]
> x. telnet application prints 'Connection established' (connect(2) call
> returns)
>
> This is a fundamental tcp/ip operation, and it can certainly by no way
> be anything that netfilter does, that would introduce a behaviour like
>
> 1. telnet 10.0.01 5150
> 2. telnet shows 'connection established'
> 3. connection 'arrives' at server
>
> [which is what you have been describing, If I understood you correctly].
I realise this is supposed to happen, however again, its not what I'm
observing, though its only noticable under high loads, so maybe its taking a
long time to get the initial SYN and ACK to the server from telnet (which
would account for both the delay before connection is established by telnet
(delay for SYN/ACK to get back from server), and the delay between this and
the server showing the connection established (which only happens after it
gets the ACK).
OK, so with this information, its taking a LONG time to get the SYN and ACK to
the server when I'm connecting to a port outside the range I'm listening on
with the application, causing the appearance of the port being connected by
telnet a long time before the server shows it being established. In any
case, this doesnt change the problem. This delay does *NOT* exist when
connecting to a port that is in the listening range (remember that all
connections being discussed here are going to systems BEHIND the router box).
I did some kernel diving, I've iscolated the place where the behavior
differentiates between a connection 'in' and 'out' of the port range, at this
if statement in get_unique_tuple (ip_nat_core.c):
if ((!(rptr->flags & IP_NAT_RANGE_PROTO_SPECIFIED)
|| proto->in_range(tuple, HOOK2MANIP(hooknum),
&rptr->min, &rptr->max))
&& !ip_nat_used_tuple(tuple, conntrack)) {
Obviously, anything connecting to a port outside this range will fail the
above if. If this fails, it will lead us to do this (one way or another):
if (proto->unique_tuple(tuple, rptr,
HOOK2MANIP(hooknum),
conntrack)) {
The unique_tuple function is where the delay is. In this case, its a TCP
connection, which translates to tcp_unique_tuple (ip_nat_proto_tcp.c), which
after determining the port range, does the following:
for (i = 0; i < range_size; i++, port++) {
*portptr = htons(min + port % range_size);
if (!ip_nat_used_tuple(tuple, conntrack)) {
return 1;
}
}
Now I'm assuming it will use this:
min = ntohs(range->min.tcp.port);
range_size = ntohs(range->max.tcp.port) - min + 1;
rather than this:
min = 1024;
range_size = 65535 - 1024 + 1;
for my port range, since I specified a local port range, however either way,
the range is about 100 ports. Which means the above nat used lookup happens
100 times (as opposed to once for when the port is inside the range). And
since ip_nat_used_tuple (ip_nat_core.c) calls ip_conntrack_tuple_taken
(ip_conntrack_core.c), which subsequently locks/unlocks the entire conntrack
table, and calls __ip_conntrack_find (which is a list lookup!), the above for
loop is a very intensive process.
Now, after all that. A few questions arise:
1) Is there a way the above for loop could lock the conntrack table
beforehand, do the searches (not locked) and unlock afterwards. This is
probably the biggest thing consuming all the speed, constant
locking/unlocking, especially when there are many new socket connections
coming in every second.
2) Is there a way the above for loop could use a random 'start' place, search
through to the end, and then start at the beginning, until it ends up back at
the same place it started from, which would avoid it constantly treading the
same ground (if the last entry used 1, its pretty assured 1 is still going to
be in use when you get the next entry, so you could either skip directly to
2, or randomly choose a new position to start at). (ie. something like:
int port, start_port = (rand() % range) + min_port);
for (port=start_port; port<min_port + range; port++)
{ if (!used) { return 1; } }
if (port == min_port + range)
for (port = min_port; port < start_port; port++)
{ if (!used) { return 1; } }
return 0;
). Obviously, the kernel's rand would need to be used, etc.
3) Why does it need to do the unique check in the first place (for DNAT and
REDIRECT), since this code should only be triggered on a new connection, and
if I have 2 new connections from the same source, to the same destination and
the same destination (nat'd) port, who cares? The destination port has to be
listening for it anyway, and theres nothing stopping it accepting 2
connections from the same source to the same local port (the source port will
always be different). I'm assuming the 'resend' checks are done before this
point anyway (using the ORIGINAL destination ip/port, not the nat'd one) in
the system's TCP stack.
Anyway, if I'm way off base here, let me know, but I'm just trying to figure
out why there is a difference between connecting to a port inside, and a port
outside the range I'm listening to on the router. Its not a small difference
(time wise) either.
And re: your other email:
1) I know SNATs is not a TCP state, the data was copied from a file I generate
every 5 minutes gathering this data, and that was the line above what I ment
to copy.
2) When I said 'without state', I mean UDP connections, since they don't have
TCP state information.
- --
PreZ
Systems Administrator
Shadow Realm
PGP FingerPrint: B3 0C F3 32 DE 5A 7D 90 26 F6 FA 38 CC 0A 2D D8
Finger prez@xxxxxxxxxxxxx for full PGP public key.
Shadow Realm, a hobbyist ISP supplying real internet services.
http://www.srealm.net.au
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)
iD8DBQE+1/2LKFp14D8AGEQRAhvtAKCGLi7R2DhiAGbluUhSxRBcuViR9wCfdU4J
+ycatke7LILAa+xRYDPOb7c=
=Ngon
-----END PGP SIGNATURE-----
--- End Message ---