On Thu, Oct 17, 2024 at 02:46:32PM +0200, Florian Westphal wrote: > Antonio Ojea <antonio.ojea.garcia@xxxxxxxxx> wrote: > > In the context of Kubernetes, when DNATing entries for UDP Services, > > we need to deal with some edge cases where some UDP entries are left > > orphaned but blackhole the traffic to the new endpoints. > > > > At high level, the scenario is: > > - Client IP_A sends UDP traffic to VirtualIP IP_B --> Kubernetes > > Translates this to Endpoint IP_C > > - Endpoint IP_C is replaced by Endpoint IP_D, but since Client IP_A > > does not stop sending traffic, the conntrack entry IP_A IP_B --> IP_C > > takes precedence and is being renewed, so traffic is not sent to the > > new Endpoint IP_D and is lost. > > > > To solve this problem, we have some heuristics to detect those > > scenarios when the endpoints change and flush the conntrack entries, > > however, since this is event based, if we lost the event that > > triggered the problem or something happens that fails to clean up the > > entry, the user need to manually flush the entries. > > > > We are implementing a new approach to solve this, we list all the UDP > > conntrack entries using netlink, compare against the existing > > programmed nftables/iptables rules, and flush the ones we know are > > stale. > > > > During the implementation review, the question [1] this raises is, how > > impactful is it to dump all the conntrack entries each time we program > > the iptables/nftables rules (this can be every 1s on nodes with a lot > > of entries)? > > Is this approach completely safe? > > Should we try to read from procfs instead? > > Walking all conntrack entries in 1s intervals is going to be slow, no > matter the chosen interface. Even doing the filtering in the kernel to > not dump all entries but only those that match udp/port/ip criteria is > not going to change it. > > Also both proc and netlink dumps can miss entries (albeit its rare), > if parallel insertions/deletes happen (which is normal on busy system). > > I wonder why the appropriate delete requests cannot be done when the > mapping is altered, I mean, you must have some code that issues > either iptables -t nat -D ... or nft delete element ... or similar. > > If you do that, why not also fire off the conntrack -D request > afterwards? Or are these publish/withdraw so frequent that this > doesn't matter compared to poll based approach? > > Something like > conntrack -D --protonum 17 --orig-dst $vserver --orig-port-dst 53 --reply-src $rserver --reply-port-src 5353 > > would zap everything to $rserver mapped to $vserver from client point of view. This reminds me, it would be good to expand conntrack utility to use the new kernel API to filter from kernel + delete. I will try to get here.