On Fri, Apr 14, 2023 at 10:12:14AM +0200, Pablo Neira Ayuso wrote: > On Fri, Apr 14, 2023 at 11:52:02AM +0800, Tzung-Bi Shih wrote: > > On Thu, Apr 13, 2023 at 11:04:24AM +0200, Pablo Neira Ayuso wrote: > > > On Thu, Apr 13, 2023 at 11:23:11AM +0800, Tzung-Bi Shih wrote: > > > > nf_ct_expires() got called by userspace program. And the return > > > > value (which means the remaining timeout) will be the parameter for > > > > the next ctnetlink_change_timeout(). > > > > > > Unconfirmed conntrack is owned by the packet that refers to it, it is > > > not yet in the hashes. I don't see how concurrent access to the > > > timeout might occur. > > > > > > Or are you referring to a different scenario that triggers the partial > > > state? > > > > I think it isn't a concurrent access. We observed that userspace program > > reads the remaining timeout and sets it back for unconfirmed conntrack. > > > > Conceptually, it looks like: > > timeout = nf_ct_expires(...); (1) > > ctnetlink_change_timeout(...timeout); > > How could this possibly happen? > > nf_ct_expires() is called from: > > - ctnetlink_dump_timeout(), this is netlink dump path, since: > > commit 1397af5bfd7d32b0cf2adb70a78c9a9e8f11d912 > Author: Florian Westphal <fw@xxxxxxxxx> > Date: Mon Apr 11 13:01:18 2022 +0200 > > netfilter: conntrack: remove the percpu dying list > > it is not possible to do any listing of unconfirmed, and that is fine. > > As I said, nfnetlink_queue operates with an unconfirmed conntrack with > owns exclusively, which is not in the hashes. I applied the following patches on top of v6.3-rc5[5]. As you suggested: diff --git a/include/net/netfilter/nf_conntrack_core.h b/include/net/netfilter/nf_conntrack_core.h index 71d1269fe4d4..3384859a8921 100644 --- a/include/net/netfilter/nf_conntrack_core.h +++ b/include/net/netfilter/nf_conntrack_core.h @@ -89,7 +89,11 @@ static inline void __nf_ct_set_timeout(struct nf_conn *ct, u64 timeout) { if (timeout > INT_MAX) timeout = INT_MAX; - WRITE_ONCE(ct->timeout, nfct_time_stamp + (u32)timeout); + + if (nf_ct_is_confirmed(ct)) + WRITE_ONCE(ct->timeout, nfct_time_stamp + (u32)timeout); + else + ct->timeout = (u32)timeout; } And this patch for dumping the debug information: diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c index bfc3aaa2c872..b4c016ff07e9 100644 --- a/net/netfilter/nf_conntrack_netlink.c +++ b/net/netfilter/nf_conntrack_netlink.c @@ -178,6 +178,11 @@ static int ctnetlink_dump_timeout(struct sk_buff *skb, const struct nf_conn *ct, { long timeout = nf_ct_expires(ct) / HZ; + if (!nf_ct_is_confirmed(ct)) { + dump_stack(); + printk(KERN_ERR "===ct->timeout=%u===timeout=%ld\n", ct->timeout, timeout); + } + if (skip_zero && timeout == 0) return 0; And here is the trimmed dmesg: ===ct->timeout=30000===timeout=0 [...] 6.3.0-rc5-00675-ged0c7cbf5b2c [...] Call Trace: <TASK> dump_stack_lvl ctnetlink_dump_timeout __ctnetlink_glue_build ctnetlink_glue_build __nfqnl_enqueue_packet nf_queue nf_hook_slow ip_mc_output ? __pfx_ip_finish_output ip_send_skb ? __pfx_dst_output udp_send_skb udp_sendmsg ? __pfx_ip_generic_getfrag sock_sendmsg As the dmesg showed, the unconfirmed conntrack did call into ctnetlink_dump_timeout(). As a result, the value returned from nf_ct_expires() is always 0 in the case (because the `ct->timeout` got subtracted by current timestamp `nfct_time_stamp`[4]). [4]: https://elixir.bootlin.com/linux/v6.3-rc6/source/include/net/netfilter/nf_conntrack.h#L296 [5]: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/refs/heads/merge/continuous/chromeos-kernelupstream-6.3-rc5