Re: Most optimal method to dump UDP conntrack entries

Florian Westphal <fw@xxxxxxxxx> · Thu, 17 Oct 2024 14:46:32 +0200

Antonio Ojea <antonio.ojea.garcia@xxxxxxxxx> wrote:
> In the context of Kubernetes, when DNATing entries for UDP Services,
> we need to deal with some edge cases where some UDP entries are left
> orphaned but blackhole the traffic to the new endpoints.
> 
> At high level, the scenario is:
> - Client IP_A sends UDP traffic to VirtualIP IP_B --> Kubernetes
> Translates this to Endpoint IP_C
> - Endpoint IP_C is replaced by Endpoint IP_D, but since Client IP_A
> does not stop sending traffic, the conntrack entry IP_A IP_B --> IP_C
> takes precedence and is being renewed, so traffic is not sent to the
> new Endpoint IP_D and is lost.
> 
> To solve this problem, we have some heuristics to detect those
> scenarios when the endpoints change and flush the conntrack entries,
> however, since this is event based, if we lost the event that
> triggered the problem or something happens that fails to clean up the
> entry,  the user need to manually flush the entries.
> 
> We are implementing a new approach to solve this, we list all the UDP
> conntrack entries using netlink, compare against the existing
> programmed nftables/iptables rules, and flush the ones we know are
> stale.
> 
> During the implementation review, the question [1] this raises is, how
> impactful is it to dump all the conntrack entries each time we program
> the iptables/nftables rules (this can be every 1s on nodes with a lot
> of entries)?
> Is this approach completely safe?
> Should we try to read from procfs instead?

Walking all conntrack entries in 1s intervals is going to be slow, no
matter the chosen interface.  Even doing the filtering in the kernel to
not dump all entries but only those that match udp/port/ip criteria is
not going to change it.

Also both proc and netlink dumps can miss entries (albeit its rare),
if parallel insertions/deletes happen (which is normal on busy system).

I wonder why the appropriate delete requests cannot be done when the
mapping is altered, I mean, you must have some code that issues
either iptables -t nat -D ... or nft delete element ... or similar.

If you do that, why not also fire off the conntrack -D request
afterwards?  Or are these publish/withdraw so frequent that this
doesn't matter compared to poll based approach?

Something like
   conntrack -D --protonum 17 --orig-dst $vserver --orig-port-dst 53 --reply-src $rserver --reply-port-src 5353

would zap everything to $rserver mapped to $vserver from client point of view.

Granted, this isn't great either, but you would not have to poll
all the time?  Or are updates

Is this only a problem for UDP?  I wonder if we should change UDP
conntrack to no longer refresh timestamp for original direction if
connection is subject to NAT, that would make them expire in the given
'dnat mapping went away and client tries to talk to unreachable host'
scenario.