Re: Most optimal method to dump UDP conntrack entries

Antonio Ojea <antonio.ojea.garcia@xxxxxxxxx> · Fri, 18 Oct 2024 12:05:40 +0100

On Fri, 18 Oct 2024 at 00:30, Florian Westphal <fw@xxxxxxxxx> wrote:
>
> Antonio Ojea <antonio.ojea.garcia@xxxxxxxxx> wrote:
> > We are not worried about being slow in the order of seconds, the
> > system is eventually consistent so there can always be a reasonable
> > latency.
> > Since we only care about UDP, losing packets during that period is not
> > desirable but is assumable.
> > My main concern is if constantly dumping all the entries via netlink
> > can cause any issue or increase resources consumption.
>
> It doesn't need lots of extra memory on kernel side, just for the
> some internal tracking ("where were we") and temporary netlink
> buffer to assemble the dump chunks returned to userspace.
>
> > > > Also both proc and netlink dumps can miss entries (albeit its rare),
> > > > if parallel insertions/deletes happen (which is normal on busy system).
> >
> > That is one of the reasons we want to implement this reconcile loop,
> > so it can be resilient to this kind of errors, we keep the state on
> > the API in the control plane, so we can always rebuild the state in
> > the dataplane (recreating nftables rules and delete conntrack entries
> > that does not match the current state)
>
> OK, but wouldn't it be better to not need this at all?
>

Absolutely, that will be awesome

> The UDP tracker is old and refreshs the timeout for every packet, but
> who says it needs to stay that way?
>
> With very simple patch to only refresh in 'reply' dir:
>
>     [NEW] udp      17 30 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
> [DESTROY] udp      17 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
>     [NEW] udp      17 30 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
> [DESTROY] udp      17 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
>     [NEW] udp      17 30 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
>
> This is with a unidirectional UDP stream.
>
> diff --git a/net/netfilter/nf_conntrack_proto_udp.c b/net/netfilter/nf_conntrack_proto_udp.c
> --- a/net/netfilter/nf_conntrack_proto_udp.c
> +++ b/net/netfilter/nf_conntrack_proto_udp.c
> @@ -17,6 +17,7 @@
>  #include <linux/netfilter.h>
>  #include <linux/netfilter_ipv4.h>
>  #include <linux/netfilter_ipv6.h>
> +#include <net/netfilter/nf_conntrack_acct.h>
>  #include <net/netfilter/nf_conntrack_l4proto.h>
>  #include <net/netfilter/nf_conntrack_ecache.h>
>  #include <net/netfilter/nf_conntrack_timeout.h>
> @@ -80,6 +81,35 @@ static bool udp_error(struct sk_buff *skb,
>         return false;
>  }
>
> +static void nf_ct_refresh_udp(struct nf_conn *ct,
> +                             enum ip_conntrack_dir dir,
> +                             u32 len, u32 extra_jiffies)
> +{
> +       /* Only update if this is not a fixed timeout */
> +       if (test_bit(IPS_FIXED_TIMEOUT_BIT, &ct->status))
> +               return nf_ct_acct_update(ct, dir, len);
> +
> +       switch (dir) {
> +       case IP_CT_DIR_ORIGINAL:
> +               /* do not refresh in original dir, else stale dnat mapping
> +                * can be kept alive indefinitely.
> +                */
> +               if (nf_ct_is_confirmed(ct))
> +                       return nf_ct_acct_update(ct, dir, len);
> +               break;
> +       case IP_CT_DIR_REPLY:
> +               extra_jiffies += nfct_time_stamp;
> +               break;
> +       case IP_CT_DIR_MAX: /* silence warning wrt. unhandled enum */
> +               break;
> +       }
> +
> +       if (READ_ONCE(ct->timeout) != extra_jiffies)
> +               WRITE_ONCE(ct->timeout, extra_jiffies);
> +
> +       nf_ct_acct_update(ct, dir, len);
> +}
> +
>  /* Returns verdict for packet, and may modify conntracktype */
>  int nf_conntrack_udp_packet(struct nf_conn *ct,
>                             struct sk_buff *skb,
> @@ -114,7 +144,7 @@ int nf_conntrack_udp_packet(struct nf_conn *ct,
>                         stream = (status & IPS_ASSURED) == 0;
>                 }
>
> -               nf_ct_refresh_acct(ct, ctinfo, skb, extra);
> +               nf_ct_refresh_udp(ct, CTINFO2DIR(ctinfo), skb->len, extra);
>
>                 /* never set ASSURED for IPS_NAT_CLASH, they time out soon */
>                 if (unlikely((status & IPS_NAT_CLASH)))
> @@ -124,7 +154,7 @@ int nf_conntrack_udp_packet(struct nf_conn *ct,
>                 if (stream && !test_and_set_bit(IPS_ASSURED_BIT, &ct->status))
>                         nf_conntrack_event_cache(IPCT_ASSURED, ct);
>         } else {
> -               nf_ct_refresh_acct(ct, ctinfo, skb, timeouts[UDP_CT_UNREPLIED]);
> +               nf_ct_refresh_udp(ct, CTINFO2DIR(ctinfo), skb->len, timeouts[UDP_CT_UNREPLIED]);
>         }
>         return NF_ACCEPT;
>  }
>
> The obvious counter-argument is "but what if its the reply side that is
> sending, and we have a stale SNAT mapping?".
>
> Solution would be to track timestamp of last packet seen per direction,
> and then stop refreshing the timeout if one side ceases sending.
>
> I think thats much better than what we have now and there would be no
> need to actively zap UDP flows, they would timeout automatically once
> one side ceases to transmit.

If I understand correctly this will have some visible effects, before
this change the same tuple will only hit the same destination because
the conntrack entry will prevail, and after this patch the packet will
be spread out randomly to the number of backends.
I'm aware of some requests to have this behavior in kubernetes for UDP
service, but I'm scared that there can be some people relying on this
behavior ... and all the possible edge cases this will open, what
happens if we have two UDP packets in one direction before the first
reply is seen?