Re: Most optimal method to dump UDP conntrack entries

Florian Westphal <fw@xxxxxxxxx> · Fri, 18 Oct 2024 13:33:18 +0200

Antonio Ojea <antonio.ojea.garcia@xxxxxxxxx> wrote:
> > The UDP tracker is old and refreshs the timeout for every packet, but
> > who says it needs to stay that way?
> >
> > With very simple patch to only refresh in 'reply' dir:
> >
> >     [NEW] udp      17 30 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
> > [DESTROY] udp      17 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
> >     [NEW] udp      17 30 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
> > [DESTROY] udp      17 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
> >     [NEW] udp      17 30 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
> >
> > This is with a unidirectional UDP stream.
> >
> > diff --git a/net/netfilter/nf_conntrack_proto_udp.c b/net/netfilter/nf_conntrack_proto_udp.c
> > --- a/net/netfilter/nf_conntrack_proto_udp.c
> > +++ b/net/netfilter/nf_conntrack_proto_udp.c
> > @@ -17,6 +17,7 @@
> >  #include <linux/netfilter.h>
> >  #include <linux/netfilter_ipv4.h>
> >  #include <linux/netfilter_ipv6.h>
> > +#include <net/netfilter/nf_conntrack_acct.h>
> >  #include <net/netfilter/nf_conntrack_l4proto.h>
> >  #include <net/netfilter/nf_conntrack_ecache.h>
> >  #include <net/netfilter/nf_conntrack_timeout.h>
> > @@ -80,6 +81,35 @@ static bool udp_error(struct sk_buff *skb,
> >         return false;
> >  }
> >
> > +static void nf_ct_refresh_udp(struct nf_conn *ct,
> > +                             enum ip_conntrack_dir dir,
> > +                             u32 len, u32 extra_jiffies)
> > +{
> > +       /* Only update if this is not a fixed timeout */
> > +       if (test_bit(IPS_FIXED_TIMEOUT_BIT, &ct->status))
> > +               return nf_ct_acct_update(ct, dir, len);
> > +
> > +       switch (dir) {
> > +       case IP_CT_DIR_ORIGINAL:
> > +               /* do not refresh in original dir, else stale dnat mapping
> > +                * can be kept alive indefinitely.
> > +                */
> > +               if (nf_ct_is_confirmed(ct))
> > +                       return nf_ct_acct_update(ct, dir, len);
> > +               break;
> > +       case IP_CT_DIR_REPLY:
> > +               extra_jiffies += nfct_time_stamp;
> > +               break;
> > +       case IP_CT_DIR_MAX: /* silence warning wrt. unhandled enum */
> > +               break;
> > +       }
> > +
> > +       if (READ_ONCE(ct->timeout) != extra_jiffies)
> > +               WRITE_ONCE(ct->timeout, extra_jiffies);
> > +
> > +       nf_ct_acct_update(ct, dir, len);
> > +}
> > +
> >  /* Returns verdict for packet, and may modify conntracktype */
> >  int nf_conntrack_udp_packet(struct nf_conn *ct,
> >                             struct sk_buff *skb,
> > @@ -114,7 +144,7 @@ int nf_conntrack_udp_packet(struct nf_conn *ct,
> >                         stream = (status & IPS_ASSURED) == 0;
> >                 }
> >
> > -               nf_ct_refresh_acct(ct, ctinfo, skb, extra);
> > +               nf_ct_refresh_udp(ct, CTINFO2DIR(ctinfo), skb->len, extra);
> >
> >                 /* never set ASSURED for IPS_NAT_CLASH, they time out soon */
> >                 if (unlikely((status & IPS_NAT_CLASH)))
> > @@ -124,7 +154,7 @@ int nf_conntrack_udp_packet(struct nf_conn *ct,
> >                 if (stream && !test_and_set_bit(IPS_ASSURED_BIT, &ct->status))
> >                         nf_conntrack_event_cache(IPCT_ASSURED, ct);
> >         } else {
> > -               nf_ct_refresh_acct(ct, ctinfo, skb, timeouts[UDP_CT_UNREPLIED]);
> > +               nf_ct_refresh_udp(ct, CTINFO2DIR(ctinfo), skb->len, timeouts[UDP_CT_UNREPLIED]);
> >         }
> >         return NF_ACCEPT;
> >  }
> >
> > The obvious counter-argument is "but what if its the reply side that is
> > sending, and we have a stale SNAT mapping?".
> >
> > Solution would be to track timestamp of last packet seen per direction,
> > and then stop refreshing the timeout if one side ceases sending.
> >
> > I think thats much better than what we have now and there would be no
> > need to actively zap UDP flows, they would timeout automatically once
> > one side ceases to transmit.
> 
> If I understand correctly this will have some visible effects, before
> this change the same tuple will only hit the same destination because
> the conntrack entry will prevail, and after this patch the packet will
> be spread out randomly to the number of backends.

Nothing changes unless flow is 100% uni-directional (no
return traffic OR only reply traffic after initial packet).

Old behavior:

Packet	 Direction
     1   Origin			# create entry with 30s timeout, save current time
     2   Reply			# flag entry assured (cannot be fast-removed
				# if conntrack table is full), refresh timeout to 30s
     3  Origin			# refresh timeout to 30s OR 2 Minutes, if current time is 2s in the past
     4  Origin			# refresh timeout to 30s OR 2 Minutes, if current time is 2s in the past
     ...
     5  Origin			# refresh timeout to 30s OR 2 Minutes, if current time is 2s in the past

Thus, once DNAT mapping changes, conntrack entry remains alive for as
long as origin sends more packets.

After change:

Packet	 Direction
     1   Origin			# create entry with 30s timeout, save current time
     2   Reply			# flag entry assured (cannot be fast-removed
				# if conntrack table is full), refresh timeout to 30s
     3  Origin			# do not refresh timeout
     4  Origin			# do not refresh timeout
     ...
     5  Origin			# do not refresh timeout

Once 30s have elapsed, this entry will expire, even if packets are seen.
Next packet will create a new conntrack, and a new NAT mapping could be
chosen.

If there is reply traffic, nothing changes:

Packet	 Direction
     1   Origin			# create entry with 30s timeout, save current time
     2   Reply			# flag entry assured (cannot be fast-removed
				# if conntrack table is full), refresh timeout to 30s
     3  Origin			# do not refresh timeout
     4  Reply			# refresh timeout to 30s OR 2 Minutes, if current time is 2s in the past
     ...
     5  Origin			# do not refresh timeout

Thats the general idea anyway.
I'll try to make this more robust and also let it timeout if we have
only reply traffic without any origin packets.

> I'm aware of some requests to have this behavior in kubernetes for UDP
> service, but I'm scared that there can be some people relying on this
> behavior ... and all the possible edge cases this will open, what
> happens if we have two UDP packets in one direction before the first
> reply is seen?

Same as what happens now, 2nd packet follows NAT mapping of first one.