Most optimal method to dump UDP conntrack entries

Antonio Ojea <antonio.ojea.garcia@xxxxxxxxx> · Thu, 17 Oct 2024 12:26:12 +0200

Hi,

In the context of Kubernetes, when DNATing entries for UDP Services,
we need to deal with some edge cases where some UDP entries are left
orphaned but blackhole the traffic to the new endpoints.

At high level, the scenario is:
- Client IP_A sends UDP traffic to VirtualIP IP_B --> Kubernetes
Translates this to Endpoint IP_C
- Endpoint IP_C is replaced by Endpoint IP_D, but since Client IP_A
does not stop sending traffic, the conntrack entry IP_A IP_B --> IP_C
takes precedence and is being renewed, so traffic is not sent to the
new Endpoint IP_D and is lost.

To solve this problem, we have some heuristics to detect those
scenarios when the endpoints change and flush the conntrack entries,
however, since this is event based, if we lost the event that
triggered the problem or something happens that fails to clean up the
entry,  the user need to manually flush the entries.

We are implementing a new approach to solve this, we list all the UDP
conntrack entries using netlink, compare against the existing
programmed nftables/iptables rules, and flush the ones we know are
stale.

During the implementation review, the question [1] this raises is, how
impactful is it to dump all the conntrack entries each time we program
the iptables/nftables rules (this can be every 1s on nodes with a lot
of entries)?
Is this approach completely safe?
Should we try to read from procfs instead?
Any other suggestions?

Thanks,
A.Ojea

[1]: https://github.com/kubernetes/kubernetes/pull/127318#discussion_r1756967553