Re: [RFC nf-next 0/4] netfilter: conntrack: allow insertion of clashing entries

Kadlecsik József <kadlec@xxxxxxxxxxxxxxxxx> · Tue, 14 Jan 2020 22:14:58 +0100 (CET)

Hi Florian,

On Tue, 14 Jan 2020, Florian Westphal wrote:

> Florian Westphal <fw@xxxxxxxxx> wrote:
> > This entire series isn't nice but so far I did not find a better
> > solution.
> 
> I did consider getting rid of the unconfirmed list, but this is also
> problematic.
> 
> At allocation time we do not know what kind of NAT transformations
> will be applied by the ruleset, i.e. we'd need another locking step to
> move the entries to the right location in the hash table.
> 
> Same if the skb is dropped: we need to lock the conntrack table again to
> delete the newly added entry -- this isn't needed right now because the
> conntrack is only on the percpu unconfirmed list in this case.
> 
> This is also a problem because of conntrack events, we would have to
> seperate insertion and notification, else we'd flood userspace for every
> conntrack we create in case of a packet drop flood.
> 
> Other solutions are:
> 1. use a ruleset that assigns the same nat mapping for both A and AAAA
>    requests, or,
> 2. steer all packets that might have this problem (i.e. udp dport 53) to
>     the same cpu core.
> 
> Yet another solution would be a variation of this patch set:
> 
> 1. Only add the reply direction to the table (i.e. conntrack -L won't show
>    the duplicated entry).
> 2. Add a new conntrack flag for the duplicate that guarantees the
>    conntrack is removed immediately when first reply packet comes in.
>    This would also have the effect that the conntrack can never be
>    assured, i.e. the "hidden duplicates" are always early-dropable if
>    conntrack table gets full.
> 3. change event code to never report such duplicates to userspace.

Somehow my general feeling is that all proposed fixes in conntrack could 
in some cases break other non single-request - single-response UDP 
applications.

Reading about the kubernetes issue as far as I see

a. When the pods run glibc based systems, the issue could easily be
   fixed by configuring the real DNS server IP addresses in the pods
   resolv.conf files with "options single-request single-request-reopen" 
   enabled. 
b. When the pods run musl based systems, there's no such a solution
   because the main musl developer refused to implement the required
   RES_SNGLKUP and RES_SNGLKUPREOP options in musl.

However, I think there's a general already available solution in iptables: 
force the same DNAT mapping for the packets of the same socket by the 
HMARK target. Something like this:

-t raw -p udp --dport 53 -j HMARK --hmark-tuple src,sport \
	--hmark-mod 1 --hmark-offset 10 --hmark-rnd 0xdeafbeef
-t nat -p udp --dport 53 -m state --state NEW -m mark --mark 10 -j DNAT ..
-t nat -p udp --dport 53 -m state --state NEW -m mark --mark 11 -j DNAT ..

Best regards,
Jozsef
-
E-mail  : kadlec@xxxxxxxxxxxxxxxxx, kadlecsik.jozsef@xxxxxxxxxxxxx
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics
          H-1525 Budapest 114, POB. 49, Hungary