On Fri, 18 Oct 2024 at 13:33, Florian Westphal <fw@xxxxxxxxx> wrote: > > Antonio Ojea <antonio.ojea.garcia@xxxxxxxxx> wrote: > > > The UDP tracker is old and refreshs the timeout for every packet, but > > > who says it needs to stay that way? > > > > > > With very simple patch to only refresh in 'reply' dir: > > > > > > [NEW] udp 17 30 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819 > > > [DESTROY] udp 17 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819 > > > [NEW] udp 17 30 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819 > > > [DESTROY] udp 17 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819 > > > [NEW] udp 17 30 src=127.0.0.1 dst=127.0.0.1 sport=49819 dport=12345 [UNREPLIED] src=127.0.0.1 dst=127.0.0.1 sport=12345 dport=49819 > > > > > > This is with a unidirectional UDP stream. > > > > > > diff --git a/net/netfilter/nf_conntrack_proto_udp.c b/net/netfilter/nf_conntrack_proto_udp.c > > > --- a/net/netfilter/nf_conntrack_proto_udp.c > > > +++ b/net/netfilter/nf_conntrack_proto_udp.c > > > @@ -17,6 +17,7 @@ > > > #include <linux/netfilter.h> > > > #include <linux/netfilter_ipv4.h> > > > #include <linux/netfilter_ipv6.h> > > > +#include <net/netfilter/nf_conntrack_acct.h> > > > #include <net/netfilter/nf_conntrack_l4proto.h> > > > #include <net/netfilter/nf_conntrack_ecache.h> > > > #include <net/netfilter/nf_conntrack_timeout.h> > > > @@ -80,6 +81,35 @@ static bool udp_error(struct sk_buff *skb, > > > return false; > > > } > > > > > > +static void nf_ct_refresh_udp(struct nf_conn *ct, > > > + enum ip_conntrack_dir dir, > > > + u32 len, u32 extra_jiffies) > > > +{ > > > + /* Only update if this is not a fixed timeout */ > > > + if (test_bit(IPS_FIXED_TIMEOUT_BIT, &ct->status)) > > > + return nf_ct_acct_update(ct, dir, len); > > > + > > > + switch (dir) { > > > + case IP_CT_DIR_ORIGINAL: > > > + /* do not refresh in original dir, else stale dnat mapping > > > + * can be kept alive indefinitely. > > > + */ > > > + if (nf_ct_is_confirmed(ct)) > > > + return nf_ct_acct_update(ct, dir, len); > > > + break; > > > + case IP_CT_DIR_REPLY: > > > + extra_jiffies += nfct_time_stamp; > > > + break; > > > + case IP_CT_DIR_MAX: /* silence warning wrt. unhandled enum */ > > > + break; > > > + } > > > + > > > + if (READ_ONCE(ct->timeout) != extra_jiffies) > > > + WRITE_ONCE(ct->timeout, extra_jiffies); > > > + > > > + nf_ct_acct_update(ct, dir, len); > > > +} > > > + > > > /* Returns verdict for packet, and may modify conntracktype */ > > > int nf_conntrack_udp_packet(struct nf_conn *ct, > > > struct sk_buff *skb, > > > @@ -114,7 +144,7 @@ int nf_conntrack_udp_packet(struct nf_conn *ct, > > > stream = (status & IPS_ASSURED) == 0; > > > } > > > > > > - nf_ct_refresh_acct(ct, ctinfo, skb, extra); > > > + nf_ct_refresh_udp(ct, CTINFO2DIR(ctinfo), skb->len, extra); > > > > > > /* never set ASSURED for IPS_NAT_CLASH, they time out soon */ > > > if (unlikely((status & IPS_NAT_CLASH))) > > > @@ -124,7 +154,7 @@ int nf_conntrack_udp_packet(struct nf_conn *ct, > > > if (stream && !test_and_set_bit(IPS_ASSURED_BIT, &ct->status)) > > > nf_conntrack_event_cache(IPCT_ASSURED, ct); > > > } else { > > > - nf_ct_refresh_acct(ct, ctinfo, skb, timeouts[UDP_CT_UNREPLIED]); > > > + nf_ct_refresh_udp(ct, CTINFO2DIR(ctinfo), skb->len, timeouts[UDP_CT_UNREPLIED]); > > > } > > > return NF_ACCEPT; > > > } > > > > > > The obvious counter-argument is "but what if its the reply side that is > > > sending, and we have a stale SNAT mapping?". > > > > > > Solution would be to track timestamp of last packet seen per direction, > > > and then stop refreshing the timeout if one side ceases sending. > > > > > > I think thats much better than what we have now and there would be no > > > need to actively zap UDP flows, they would timeout automatically once > > > one side ceases to transmit. > > > > If I understand correctly this will have some visible effects, before > > this change the same tuple will only hit the same destination because > > the conntrack entry will prevail, and after this patch the packet will > > be spread out randomly to the number of backends. > > Nothing changes unless flow is 100% uni-directional (no > return traffic OR only reply traffic after initial packet). > > Old behavior: > > Packet Direction > 1 Origin # create entry with 30s timeout, save current time > 2 Reply # flag entry assured (cannot be fast-removed > # if conntrack table is full), refresh timeout to 30s > 3 Origin # refresh timeout to 30s OR 2 Minutes, if current time is 2s in the past > 4 Origin # refresh timeout to 30s OR 2 Minutes, if current time is 2s in the past > ... > 5 Origin # refresh timeout to 30s OR 2 Minutes, if current time is 2s in the past > > > Thus, once DNAT mapping changes, conntrack entry remains alive for as > long as origin sends more packets. > > After change: > > Packet Direction > 1 Origin # create entry with 30s timeout, save current time > 2 Reply # flag entry assured (cannot be fast-removed > # if conntrack table is full), refresh timeout to 30s > 3 Origin # do not refresh timeout > 4 Origin # do not refresh timeout > ... > 5 Origin # do not refresh timeout > > Once 30s have elapsed, this entry will expire, even if packets are seen. > Next packet will create a new conntrack, and a new NAT mapping could be > chosen. > > > If there is reply traffic, nothing changes: > > Packet Direction > 1 Origin # create entry with 30s timeout, save current time > 2 Reply # flag entry assured (cannot be fast-removed > # if conntrack table is full), refresh timeout to 30s > 3 Origin # do not refresh timeout > 4 Reply # refresh timeout to 30s OR 2 Minutes, if current time is 2s in the past > ... > 5 Origin # do not refresh timeout > > Thats the general idea anyway. > I'll try to make this more robust and also let it timeout if we have > only reply traffic without any origin packets. > > > I'm aware of some requests to have this behavior in kubernetes for UDP > > service, but I'm scared that there can be some people relying on this > > behavior ... and all the possible edge cases this will open, what > > happens if we have two UDP packets in one direction before the first > > reply is seen? > > Same as what happens now, 2nd packet follows NAT mapping of first one. This looks like the way to go ... if you can send me a patch I can do some testing next week and report back