On Thu, 17 Oct 2024 at 17:36, Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> wrote: > > On Thu, Oct 17, 2024 at 02:46:32PM +0200, Florian Westphal wrote: > > Antonio Ojea <antonio.ojea.garcia@xxxxxxxxx> wrote: > > > In the context of Kubernetes, when DNATing entries for UDP Services, > > > we need to deal with some edge cases where some UDP entries are left > > > orphaned but blackhole the traffic to the new endpoints. > > > > > > At high level, the scenario is: > > > - Client IP_A sends UDP traffic to VirtualIP IP_B --> Kubernetes > > > Translates this to Endpoint IP_C > > > - Endpoint IP_C is replaced by Endpoint IP_D, but since Client IP_A > > > does not stop sending traffic, the conntrack entry IP_A IP_B --> IP_C > > > takes precedence and is being renewed, so traffic is not sent to the > > > new Endpoint IP_D and is lost. > > > > > > To solve this problem, we have some heuristics to detect those > > > scenarios when the endpoints change and flush the conntrack entries, > > > however, since this is event based, if we lost the event that > > > triggered the problem or something happens that fails to clean up the > > > entry, the user need to manually flush the entries. > > > > > > We are implementing a new approach to solve this, we list all the UDP > > > conntrack entries using netlink, compare against the existing > > > programmed nftables/iptables rules, and flush the ones we know are > > > stale. > > > > > > During the implementation review, the question [1] this raises is, how > > > impactful is it to dump all the conntrack entries each time we program > > > the iptables/nftables rules (this can be every 1s on nodes with a lot > > > of entries)? > > > Is this approach completely safe? > > > Should we try to read from procfs instead? > > > > Walking all conntrack entries in 1s intervals is going to be slow, no > > matter the chosen interface. Even doing the filtering in the kernel to > > not dump all entries but only those that match udp/port/ip criteria is > > not going to change it. We are not worried about being slow in the order of seconds, the system is eventually consistent so there can always be a reasonable latency. Since we only care about UDP, losing packets during that period is not desirable but is assumable. My main concern is if constantly dumping all the entries via netlink can cause any issue or increase resources consumption. > > > > Also both proc and netlink dumps can miss entries (albeit its rare), > > if parallel insertions/deletes happen (which is normal on busy system). > > That is one of the reasons we want to implement this reconcile loop, so it can be resilient to this kind of errors, we keep the state on the API in the control plane, so we can always rebuild the state in the dataplane (recreating nftables rules and delete conntrack entries that does not match the current state) > > I wonder why the appropriate delete requests cannot be done when the > > mapping is altered, I mean, you must have some code that issues > > either iptables -t nat -D ... or nft delete element ... or similar. > > > > If you do that, why not also fire off the conntrack -D request > > afterwards? Or are these publish/withdraw so frequent that this > > doesn't matter compared to poll based approach? > > > > Something like > > conntrack -D --protonum 17 --orig-dst $vserver --orig-port-dst 53 --reply-src $rserver --reply-port-src 5353 > > > > would zap everything to $rserver mapped to $vserver from client point of view. > This is how it is implemented today and it works, but it does not handle process restarts per example, or is not resilient to errors. The implementation is also much more complex because we need to implement all the possible edge cases that can leave stale entries > This reminds me, it would be good to expand conntrack utility to use > the new kernel API to filter from kernel + delete. > > I will try to get here. I should bring more of these problems to the mailing list :-)