SV: Conntrack insertion race conditions -- any workarounds?

André Paulsberg-Csibi (IBM Consultant) <Andre.Paulsberg-Csibi@xxxxxxxx> · Fri, 7 Sep 2018 08:24:05 +0000

> By contrast, what's happening in the case that spawned this thread is that the client uses the same source port for two separate packets,
> but the source port is still random and unlikely to conflict with some other client, and there is no issue for the NAT to translate them back to the original client because the translation for both packets is the same. 
> The bug is that the firewall processes them incorrectly when two packets with the same new source and destination address and port are processed concurrently -- an implementation flaw, not a design flaw.
> The packet marking and so forth is only an attempted workaround until the patch is in place.

This is partially a circular argument , using the same logic why aren't the next request using the same port ?
Or why not user the same port for X seconds , then change port to save resources .
The answer is still that security matters , and at some point you would need to drop "tailgating" packets .

You can always argue that this should also be fixed in the FireWall , to avoid the scenario dropping such packets ( and I can partially agree ) .
But from a security perspective you will need to have some clear boundaries to know which and when to drop packets coming .
And DNS is a rather important and unremovable part of todays Internet , so making some effort to make it easier to protect from multiple angles are a reasonable argument .

> For example, an internal DNS cache makes outgoing queries when (for about 20% of requests) it doesn't have the record cached, and it will regularly reuse source ports. 
> By the birthday problem, if it randomly chooses a source port for each request, by around 300 queries there is a 50% chance that two of them will use the same source port.
> Not reusing ports would increase exposure to the Kaminsky attack (and risk port exhaustion).

I am pretty sure this statement is not entirely correct , first of picking a random number from 1025-65535 as your source port does not give you a 50% of hitting any random 300 ports currently in use .
As per 2018 many ( if not most ) Firewalls is not setup by default to clear any UDP state until the state TTL runs out , regardless of service and if the "last" response has come (which again is a security issue in itself ).
Secondly , you seem to forget/ignore that when we talk about reuse of SOURCE port the issue is in combination with contacting the SAME SERVICE to a single remote host ...
... which would give you all 4 headers IP header fields used in state tables to be equal ( aka SOURCE IP , SOURCE PORT , DESTINATION IP , DESTINATION PORT ) prior to responding ( and removing from state table )

Unless you use a SINGLE forwarder for your internal DNS CACHE , which would in most cases make you a standalone client with a "lower" number of request , 
you would need about 10000 lookups per second using a normal 2 second timeout per query for a single query to have a *theoretical* 50% chance to hit one of the old SOURCE PORTS  .
Statistically from the 10000 request , the average time to get response is around 10-50ms meaning you would theoretically have around 300 request active at any 33ms and expect 300 new request in the same time-frame
Of which you could *theorize* that in total there is a somewhere around 50% chance 1/300 of new request could potentially hit 1/300 of the old request within that 33ms timeframe .

However , when you consider some anecdotal evidence from a single host in a "fairly" busy WEB  PROXY CLUSTER have 1000 DNS request per second with a 90% DNS CACHE hitrate
leaving you with 100 request per second on average -> you seem likely to have other concerns if you client DNS on whatever system you setup reaches 10000 DNS request per second .
So as I told prior with state TTL , if the table leaves DNS states "alive" for lets say a UDP state TTL of 40 seconds , then that would leave 4000 states on average per server in the FireWall in above scenario 
With 10000 DNS request per second it would practially be every SOURCE port is open all the time , which in my opinion would be a another good/solid argument security-wize 
to remove from state table any/all state entries where matching response has been seen passing the firewall for DNS and allowing only ONE request and ONE response per SOURCE port .

> Using a separate source port for a second connection from the same client to the same remote IP and port is also necessary for TCP because otherwise the streams would be intermixed,
> but that isn't an issue for UDP because unlike TCP streams, datagrams preserve message boundaries.

Again , I believe this is only realy true for the endpoints ( as any component in the middle would normally not be aware of any boundaries )
And that using UDP assuming the boundaries are respected across every unit is what has caused some of the conflicts between security and pratical use of UDP .
VPN is a good example , you will find many IPSEC adaptions to allow for NAT traversal based on the issues caused by similar assumptions and use of UDP .

Again using UDP in this way is not strictly "wrong" , but being 2018 I tend to say this is "bad" design to reuse SOURCE port when it has such a low cost to avoid it .
Saving a few CPU cycles on mobil devices to save their battery might be valid for some users ,
but you potensially save more if you increased TTL on DNS where this is set at 5 - 120 seconds for the most common service used by todays mobil users .

On a completely different thought , I have no idea why after 40+ years there are still mainly 6 protocols commonly used (ignoring internal stuff like ISIS/IGP/EGP) for Internet IP ( and still primarly IPv4 ) .
TCP
UDP
ICMP
GRE
ESP
IGMP

It seems like many prefer using a know protocol like UDP for streaming rather then using SCTP or create other protocols more suited for various needs ,
Instead the seems developers try and push everything TCP cannot be used for into UDP , seemingly for any reason and at any cost without much though about todays security reality .

Best regards
André Paulsberg-Csibi
Senior Network Engineer 
IBM Services AS

Sensitivity: Internal

-----Opprinnelig melding-----
Fra: zrm <zrm@xxxxxxxxxxxxxxx> 
Sendt: fredag 7. september 2018 07.09
Til: André Paulsberg-Csibi (IBM Consultant) <Andre.Paulsberg-Csibi@xxxxxxxx>
Kopi: netfilter@xxxxxxxxxxxxxxx
Emne: Re: SV: SV: Conntrack insertion race conditions -- any workarounds?

On 09/06/2018 05:32 PM, André Paulsberg-Csibi (IBM Consultant) wrote:
>  From my understanding after development of FireWall and TCP/IP in general used in modern networking , it seems that it has been reasoned that clients should avoid to send 2 separate requests with same source port .
> ( again , not as an absolute rule , but certainly as a strong rule of 
> thumb )

This is generally true for TCP because each connection will have its own socket and unless a specific source port is requested (which there is often no reason to do), the operating system will arbitrarily assign one for each connection.

But the opposite is true for UDP -- it's common for a UDP program to bind one socket to one source port and use it for all its communications, distinguishing peers by the remote IP and port (which is provided when using the connectionless sendto(2)/recvfrom(2) but not the connection-oriented send(2)/recv(2)). If it weren't for the Kaminsky attack DNS clients would do this as well.

Using a separate source port for a second connection from the same client to the same remote IP and port is also necessary for TCP because otherwise the streams would be intermixed, but that isn't an issue for UDP because unlike TCP streams, datagrams preserve message boundaries.

I am not sure it is correct to describe 2 separate request via UDP as a flow , but I agree that the client isn't directly doing something that is "wrong" .
> As you say historically like my DHCP RELAY example it was accepted ( and normal ) to only user port 67 as both SOURCE and DESTINATION port .
> However it doesn't take much reasoning to argue that this is potentially problematic for any sort of state tables , and it seems un-economic to make more advanced state tables or mark packets to avoid certain scenarios .

These are two separate problems. One is that every client uses the same source (and destination) port, which creates a problem for any one-to-many NAT device. For each remote host only one client can have that port pair on the external IP address. This can be solved by the NAT translating the second client to some other source port, but only if the server doesn't require the client to use that specific source port.

By contrast, what's happening in the case that spawned this thread is that the client uses the same source port for two separate packets, but the source port is still random and unlikely to conflict with some other client, and there is no issue for the NAT to translate them back to the original client because the translation for both packets is the same. 
The bug is that the firewall processes them incorrectly when two packets with the same new source and destination address and port are processed concurrently -- an implementation flaw, not a design flaw. The packet marking and so forth is only an attempted workaround until the patch is in place.

> ( compared to opening 2 sockets from the various clients which have a 
> distributed "load" for millions of clients , while the "servers" and 
> "firewalls" need to be optimized for millions of requests )

When you're dealing with libc the clients run the gamut. Some will be mobile devices where every cycle they spend consumes battery life. Many of the "clients" will themselves be servers. Web crawlers and mail servers spend a good fraction of their cycles making DNS queries.

> I disagree this is a bug in the FIREWALL(s) as this would ONLY happen 
> when reusing source port which in my opinion isn't reasonable 
> optimization for simple clients

There are other circumstances where this can happen or is required to. 
For example, an internal DNS cache makes outgoing queries when (for about 20% of requests) it doesn't have the record cached, and it will regularly reuse source ports. By the birthday problem, if it randomly chooses a source port for each request, by around 300 queries there is a 50% chance that two of them will use the same source port. Not reusing ports would increase exposure to the Kaminsky attack (and risk port exhaustion).

  , and mind you this is the security feature of the FIREWALL to track states and for DNS there are no flows like you have with SIP / SYSLOG .
> Which also reuse the SOURCE port for their flows , but these are known directly for establishing such flows - which is not the same for DNS which after also using random ports for each request now make 2 for each after IPv6 was added .

IPv6 has been with us for years though, and mail servers have done a 
similar thing even longer. The host for sending mail to a domain is 
specified in the MX record unless there isn't one, in which case the A 
record is used, and some mail software will do the MX and A queries 
simultaneously.

DNS doesn't have "flows" in the sense that it is mostly individual query 
transactions (though see also dynamic updates and zone transfers), but 
treating a set of UDP DNS query transactions using the same ports as a 
flow in the style of any generic indeterminate UDP-based protocol 
produces the desired results (and is more likely not to break new or 
uncommon protocol features), so where is the advantage in 
application-protocol-specific treatment? In theory you could discard 
state sooner, but if you're so close to the point of port exhaustion 
that this really matters, you may be better off acquiring more IP 
addresses instead. Reusing the port mappings more quickly like that 
would fall into the same issue that causes TCP to have a TIME_WAIT state 
-- without it a packet (or retransmit) for the old mapping could be sent 
to the new one.