Re: Most optimal method to dump UDP conntrack entries

Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> · Mon, 11 Nov 2024 13:06:38 +0100

Hi Antonio,

On Thu, Oct 17, 2024 at 11:10:02PM +0100, Antonio Ojea wrote:
> On Thu, 17 Oct 2024 at 17:36, Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> wrote:
> >
> > On Thu, Oct 17, 2024 at 02:46:32PM +0200, Florian Westphal wrote:
> > > Antonio Ojea <antonio.ojea.garcia@xxxxxxxxx> wrote:
> > > > In the context of Kubernetes, when DNATing entries for UDP Services,
> > > > we need to deal with some edge cases where some UDP entries are left
> > > > orphaned but blackhole the traffic to the new endpoints.
> > > >
> > > > At high level, the scenario is:
> > > > - Client IP_A sends UDP traffic to VirtualIP IP_B --> Kubernetes
> > > > Translates this to Endpoint IP_C
> > > > - Endpoint IP_C is replaced by Endpoint IP_D, but since Client IP_A
> > > > does not stop sending traffic, the conntrack entry IP_A IP_B --> IP_C
> > > > takes precedence and is being renewed, so traffic is not sent to the
> > > > new Endpoint IP_D and is lost.
> > > >
> > > > To solve this problem, we have some heuristics to detect those
> > > > scenarios when the endpoints change and flush the conntrack entries,
> > > > however, since this is event based, if we lost the event that
> > > > triggered the problem or something happens that fails to clean up the
> > > > entry,  the user need to manually flush the entries.

You can still stick to the event approach, then resort to
resync/reconcile loop when userspace gets a report that events are
getting lost, ie. hybrid approach.

> > > > We are implementing a new approach to solve this, we list all the UDP
> > > > conntrack entries using netlink, compare against the existing
> > > > programmed nftables/iptables rules, and flush the ones we know are
> > > > stale.
> > > >
> > > > During the implementation review, the question [1] this raises is, how
> > > > impactful is it to dump all the conntrack entries each time we program
> > > > the iptables/nftables rules (this can be every 1s on nodes with a lot
> > > > of entries)?
> > > > Is this approach completely safe?
> > > > Should we try to read from procfs instead?
> > >
> > > Walking all conntrack entries in 1s intervals is going to be slow, no
> > > matter the chosen interface.  Even doing the filtering in the kernel to
> > > not dump all entries but only those that match udp/port/ip criteria is
> > > not going to change it.
> 
> We are not worried about being slow in the order of seconds, the
> system is eventually consistent so there can always be a reasonable
> latency.
> Since we only care about UDP, losing packets during that period is not
> desirable but is assumable.
> My main concern is if constantly dumping all the entries via netlink
> can cause any issue or increase resources consumption.
>
> > >
> > > Also both proc and netlink dumps can miss entries (albeit its rare),
> > > if parallel insertions/deletes happen (which is normal on busy system).
> > >
> 
> That is one of the reasons we want to implement this reconcile loop,
> so it can be resilient to this kind of errors, we keep the state on
> the API in the control plane, so we can always rebuild the state in
> the dataplane (recreating nftables rules and delete conntrack entries
> that does not match the current state)
> 
> > > I wonder why the appropriate delete requests cannot be done when the
> > > mapping is altered, I mean, you must have some code that issues
> > > either iptables -t nat -D ... or nft delete element ... or similar.
> > >
> > > If you do that, why not also fire off the conntrack -D request
> > > afterwards?  Or are these publish/withdraw so frequent that this
> > > doesn't matter compared to poll based approach?
> > >
> > > Something like
> > >    conntrack -D --protonum 17 --orig-dst $vserver --orig-port-dst 53 --reply-src $rserver --reply-port-src 5353
> > >
> > > would zap everything to $rserver mapped to $vserver from client point of view.
> >
> 
> This is how it is implemented today and it works, but it does not
> handle process restarts per example, or is not resilient to errors.
> The implementation is also much more complex because we need to
> implement all the possible edge cases that can leave stale entries

It should also be possible to shrink timeouts on restart via conntrack -U
which would be similar to the approach that Florian is proposing, but from
control plane rather than updating existing UDP timeout policy.