Re: Most optimal method to dump UDP conntrack entries

Antonio Ojea <antonio.ojea.garcia@xxxxxxxxx> · Fri, 18 Oct 2024 16:10:57 +0200

On Fri, 18 Oct 2024 at 13:33, Florian Westphal <fw@xxxxxxxxx> wrote:
>
> Antonio Ojea <antonio.ojea.garcia@xxxxxxxxx> wrote:
> > > The UDP tracker is old and refreshs the timeout for every packet, but
> > > who says it needs to stay that way?
> > >
> > > With very simple patch to only refresh in 'reply' dir:
> > >
> > >     [NEW] udp      17 30 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
> > > [DESTROY] udp      17 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
> > >     [NEW] udp      17 30 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
> > > [DESTROY] udp      17 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
> > >     [NEW] udp      17 30 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819
> > >
> > > This is with a unidirectional UDP stream.
> > >
> > > diff --git a/net/netfilter/nf_conntrack_proto_udp.c b/net/netfilter/nf_conntrack_proto_udp.c
> > > --- a/net/netfilter/nf_conntrack_proto_udp.c
> > > +++ b/net/netfilter/nf_conntrack_proto_udp.c
> > > @@ -17,6 +17,7 @@
> > >  #include <linux/netfilter.h>
> > >  #include <linux/netfilter_ipv4.h>
> > >  #include <linux/netfilter_ipv6.h>
> > > +#include <net/netfilter/nf_conntrack_acct.h>
> > >  #include <net/netfilter/nf_conntrack_l4proto.h>
> > >  #include <net/netfilter/nf_conntrack_ecache.h>
> > >  #include <net/netfilter/nf_conntrack_timeout.h>
> > > @@ -80,6 +81,35 @@ static bool udp_error(struct sk_buff *skb,
> > >         return false;
> > >  }
> > >
> > > +static void nf_ct_refresh_udp(struct nf_conn *ct,
> > > +                             enum ip_conntrack_dir dir,
> > > +                             u32 len, u32 extra_jiffies)
> > > +{
> > > +       /* Only update if this is not a fixed timeout */
> > > +       if (test_bit(IPS_FIXED_TIMEOUT_BIT, &ct->status))
> > > +               return nf_ct_acct_update(ct, dir, len);
> > > +
> > > +       switch (dir) {
> > > +       case IP_CT_DIR_ORIGINAL:
> > > +               /* do not refresh in original dir, else stale dnat mapping
> > > +                * can be kept alive indefinitely.
> > > +                */
> > > +               if (nf_ct_is_confirmed(ct))
> > > +                       return nf_ct_acct_update(ct, dir, len);
> > > +               break;
> > > +       case IP_CT_DIR_REPLY:
> > > +               extra_jiffies += nfct_time_stamp;
> > > +               break;
> > > +       case IP_CT_DIR_MAX: /* silence warning wrt. unhandled enum */
> > > +               break;
> > > +       }
> > > +
> > > +       if (READ_ONCE(ct->timeout) != extra_jiffies)
> > > +               WRITE_ONCE(ct->timeout, extra_jiffies);
> > > +
> > > +       nf_ct_acct_update(ct, dir, len);
> > > +}
> > > +
> > >  /* Returns verdict for packet, and may modify conntracktype */
> > >  int nf_conntrack_udp_packet(struct nf_conn *ct,
> > >                             struct sk_buff *skb,
> > > @@ -114,7 +144,7 @@ int nf_conntrack_udp_packet(struct nf_conn *ct,
> > >                         stream = (status & IPS_ASSURED) == 0;
> > >                 }
> > >
> > > -               nf_ct_refresh_acct(ct, ctinfo, skb, extra);
> > > +               nf_ct_refresh_udp(ct, CTINFO2DIR(ctinfo), skb->len, extra);
> > >
> > >                 /* never set ASSURED for IPS_NAT_CLASH, they time out soon */
> > >                 if (unlikely((status & IPS_NAT_CLASH)))
> > > @@ -124,7 +154,7 @@ int nf_conntrack_udp_packet(struct nf_conn *ct,
> > >                 if (stream && !test_and_set_bit(IPS_ASSURED_BIT, &ct->status))
> > >                         nf_conntrack_event_cache(IPCT_ASSURED, ct);
> > >         } else {
> > > -               nf_ct_refresh_acct(ct, ctinfo, skb, timeouts[UDP_CT_UNREPLIED]);
> > > +               nf_ct_refresh_udp(ct, CTINFO2DIR(ctinfo), skb->len, timeouts[UDP_CT_UNREPLIED]);
> > >         }
> > >         return NF_ACCEPT;
> > >  }
> > >
> > > The obvious counter-argument is "but what if its the reply side that is
> > > sending, and we have a stale SNAT mapping?".
> > >
> > > Solution would be to track timestamp of last packet seen per direction,
> > > and then stop refreshing the timeout if one side ceases sending.
> > >
> > > I think thats much better than what we have now and there would be no
> > > need to actively zap UDP flows, they would timeout automatically once
> > > one side ceases to transmit.
> >
> > If I understand correctly this will have some visible effects, before
> > this change the same tuple will only hit the same destination because
> > the conntrack entry will prevail, and after this patch the packet will
> > be spread out randomly to the number of backends.
>
> Nothing changes unless flow is 100% uni-directional (no
> return traffic OR only reply traffic after initial packet).
>
> Old behavior:
>
> Packet   Direction
>      1   Origin                 # create entry with 30s timeout, save current time
>      2   Reply                  # flag entry assured (cannot be fast-removed
>                                 # if conntrack table is full), refresh timeout to 30s
>      3  Origin                  # refresh timeout to 30s OR 2 Minutes, if current time is 2s in the past
>      4  Origin                  # refresh timeout to 30s OR 2 Minutes, if current time is 2s in the past
>      ...
>      5  Origin                  # refresh timeout to 30s OR 2 Minutes, if current time is 2s in the past
>
>
> Thus, once DNAT mapping changes, conntrack entry remains alive for as
> long as origin sends more packets.
>
> After change:
>
> Packet   Direction
>      1   Origin                 # create entry with 30s timeout, save current time
>      2   Reply                  # flag entry assured (cannot be fast-removed
>                                 # if conntrack table is full), refresh timeout to 30s
>      3  Origin                  # do not refresh timeout
>      4  Origin                  # do not refresh timeout
>      ...
>      5  Origin                  # do not refresh timeout
>
> Once 30s have elapsed, this entry will expire, even if packets are seen.
> Next packet will create a new conntrack, and a new NAT mapping could be
> chosen.
>
>
> If there is reply traffic, nothing changes:
>
> Packet   Direction
>      1   Origin                 # create entry with 30s timeout, save current time
>      2   Reply                  # flag entry assured (cannot be fast-removed
>                                 # if conntrack table is full), refresh timeout to 30s
>      3  Origin                  # do not refresh timeout
>      4  Reply                   # refresh timeout to 30s OR 2 Minutes, if current time is 2s in the past
>      ...
>      5  Origin                  # do not refresh timeout
>
> Thats the general idea anyway.
> I'll try to make this more robust and also let it timeout if we have
> only reply traffic without any origin packets.
>
> > I'm aware of some requests to have this behavior in kubernetes for UDP
> > service, but I'm scared that there can be some people relying on this
> > behavior ... and all the possible edge cases this will open, what
> > happens if we have two UDP packets in one direction before the first
> > reply is seen?
>
> Same as what happens now, 2nd packet follows NAT mapping of first one.

This looks like the way to go ... if you can send me a patch I can do
some testing next week and report back