Re: [PATCH] netfilter: conntrack: fix wrong ct->timeout value

Tzung-Bi Shih <tzungbi@xxxxxxxxxx> · Mon, 17 Apr 2023 11:41:41 +0800

On Fri, Apr 14, 2023 at 10:12:14AM +0200, Pablo Neira Ayuso wrote:
> On Fri, Apr 14, 2023 at 11:52:02AM +0800, Tzung-Bi Shih wrote:
> > On Thu, Apr 13, 2023 at 11:04:24AM +0200, Pablo Neira Ayuso wrote:
> > > On Thu, Apr 13, 2023 at 11:23:11AM +0800, Tzung-Bi Shih wrote:
> > > > nf_ct_expires() got called by userspace program.  And the return
> > > > value (which means the remaining timeout) will be the parameter for
> > > > the next ctnetlink_change_timeout().
> > > 
> > > Unconfirmed conntrack is owned by the packet that refers to it, it is
> > > not yet in the hashes. I don't see how concurrent access to the
> > > timeout might occur.
> > > 
> > > Or are you referring to a different scenario that triggers the partial
> > > state?
> > 
> > I think it isn't a concurrent access.  We observed that userspace program
> > reads the remaining timeout and sets it back for unconfirmed conntrack.
> > 
> > Conceptually, it looks like:
> > timeout = nf_ct_expires(...);  (1)
> > ctnetlink_change_timeout(...timeout);
> 
> How could this possibly happen?
> 
> nf_ct_expires() is called from:
> 
> - ctnetlink_dump_timeout(), this is netlink dump path, since:
> 
> commit 1397af5bfd7d32b0cf2adb70a78c9a9e8f11d912
> Author: Florian Westphal <fw@xxxxxxxxx>
> Date:   Mon Apr 11 13:01:18 2022 +0200
> 
>     netfilter: conntrack: remove the percpu dying list
> 
> it is not possible to do any listing of unconfirmed, and that is fine.
> 
> As I said, nfnetlink_queue operates with an unconfirmed conntrack with
> owns exclusively, which is not in the hashes.

I applied the following patches on top of v6.3-rc5[5].

As you suggested:

diff --git a/include/net/netfilter/nf_conntrack_core.h b/include/net/netfilter/nf_conntrack_core.h
index 71d1269fe4d4..3384859a8921 100644
--- a/include/net/netfilter/nf_conntrack_core.h
+++ b/include/net/netfilter/nf_conntrack_core.h
@@ -89,7 +89,11 @@ static inline void __nf_ct_set_timeout(struct nf_conn *ct, u64 timeout)
 {
        if (timeout > INT_MAX)
                timeout = INT_MAX;
-       WRITE_ONCE(ct->timeout, nfct_time_stamp + (u32)timeout);
+
+       if (nf_ct_is_confirmed(ct))
+               WRITE_ONCE(ct->timeout, nfct_time_stamp + (u32)timeout);
+       else
+               ct->timeout = (u32)timeout;
 }
 
And this patch for dumping the debug information:

diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index bfc3aaa2c872..b4c016ff07e9 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -178,6 +178,11 @@ static int ctnetlink_dump_timeout(struct sk_buff *skb, const struct nf_conn *ct,
 {
        long timeout = nf_ct_expires(ct) / HZ;
 
+       if (!nf_ct_is_confirmed(ct)) {
+               dump_stack();
+               printk(KERN_ERR "===ct->timeout=%u===timeout=%ld\n", ct->timeout, timeout);
+       }
+
        if (skip_zero && timeout == 0)
                return 0;

And here is the trimmed dmesg:

===ct->timeout=30000===timeout=0
[...] 6.3.0-rc5-00675-ged0c7cbf5b2c [...]
Call Trace:
 <TASK>
 dump_stack_lvl
 ctnetlink_dump_timeout
 __ctnetlink_glue_build
 ctnetlink_glue_build
 __nfqnl_enqueue_packet
 nf_queue
 nf_hook_slow
 ip_mc_output
 ? __pfx_ip_finish_output
 ip_send_skb
 ? __pfx_dst_output
 udp_send_skb
 udp_sendmsg
 ? __pfx_ip_generic_getfrag
 sock_sendmsg

As the dmesg showed, the unconfirmed conntrack did call into
ctnetlink_dump_timeout().  As a result, the value returned from
nf_ct_expires() is always 0 in the case (because the `ct->timeout` got
subtracted by current timestamp `nfct_time_stamp`[4]).

[4]: https://elixir.bootlin.com/linux/v6.3-rc6/source/include/net/netfilter/nf_conntrack.h#L296
[5]: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/refs/heads/merge/continuous/chromeos-kernelupstream-6.3-rc5