Re: [PATCH net v2] netfilter: nf_flow_table: fix teardown flow timeout

Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> · Tue, 17 May 2022 10:32:03 +0200

On Mon, May 16, 2022 at 08:23:10PM +0200, Sven Auhagen wrote:
> On Mon, May 16, 2022 at 07:50:09PM +0200, Pablo Neira Ayuso wrote:
> > On Mon, May 16, 2022 at 03:02:13PM +0200, Sven Auhagen wrote:
> > > On Mon, May 16, 2022 at 02:43:06PM +0200, Pablo Neira Ayuso wrote:
> > > > On Mon, May 16, 2022 at 02:23:00PM +0200, Sven Auhagen wrote:
> > > > > On Mon, May 16, 2022 at 02:13:03PM +0200, Pablo Neira Ayuso wrote:
> > > > > > On Mon, May 16, 2022 at 12:56:41PM +0200, Pablo Neira Ayuso wrote:
> > > > > > > On Thu, May 12, 2022 at 09:28:03PM +0300, Oz Shlomo wrote:
> > [...]
> > > > > > > [...]
> > > > > > > > diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> > > > > > > > index 0164e5f522e8..324fdb62c08b 100644
> > > > > > > > --- a/net/netfilter/nf_conntrack_core.c
> > > > > > > > +++ b/net/netfilter/nf_conntrack_core.c
> > > > > > > > @@ -1477,7 +1477,8 @@ static void gc_worker(struct work_struct *work)
> > > > > > > >  			tmp = nf_ct_tuplehash_to_ctrack(h);
> > > > > > > >
> > > > > > > >  			if (test_bit(IPS_OFFLOAD_BIT, &tmp->status)) {
> > > > > > > > -				nf_ct_offload_timeout(tmp);
> > > > > > >
> > > > > > > Hm, it is the trick to avoid checking for IPS_OFFLOAD from the packet
> > > > > > > path that triggers the race, ie. nf_ct_is_expired()
> > > > > > >
> > > > > > > The flowtable ct fixup races with conntrack gc collector.
> > > > > > >
> > > > > > > Clearing IPS_OFFLOAD might result in offloading the entry again for
> > > > > > > the closing packets.
> > > > > > >
> > > > > > > Probably clear IPS_OFFLOAD from teardown, and skip offload if flow is
> > > > > > > in a TCP state that represent closure?
> > > > > > >
> > > > > > >   		if (unlikely(!tcph || tcph->fin || tcph->rst))
> > > > > > >   			goto out;
> > > > > > >
> > > > > > > this is already the intention in the existing code.
> > > > > >
> > > > > > I'm attaching an incomplete sketch patch. My goal is to avoid the
> > > > > > extra IPS_ bit.
> > > > >
> > > > > You might create a race with ct gc that will remove the ct
> > > > > if it is in close or end of close and before flow offload teardown is running
> > > > > so flow offload teardown might access memory that was freed.
> > > >
> > > > flow object holds a reference to the ct object until it is released,
> > > > no use-after-free can happen.
> > > >
> > >
> > > Also if nf_ct_delete is called before flowtable delete?
> > > Can you let me know why?
> >
> > nf_ct_delete() removes the conntrack object from lists and it
> > decrements the reference counter by one.
> >
> > flow_offload_free() also calls nf_ct_put(). flow_offload_alloc() bumps
> > the reference count on the conntrack object before creating the flow.
> >
> > > > > It is not a very likely scenario but never the less it might happen now
> > > > > since the IPS_OFFLOAD_BIT is not set and the state might just time out.
> > > > >
> > > > > If someone sets a very small TCP CLOSE timeout it gets more likely.
> > > > >
> > > > > So Oz and myself were debatting about three possible cases/problems:
> > > > >
> > > > > 1. ct gc sets timeout even though the state is in CLOSE/FIN because the
> > > > > IPS_OFFLOAD is still set but the flow is in teardown
> > > > > 2. ct gc removes the ct because the IPS_OFFLOAD is not set and
> > > > > the CLOSE timeout is reached before the flow offload del
> > > >
> > > > OK.
> > > >
> > > > > 3. tcp ct is always set to ESTABLISHED with a very long timeout
> > > > > in flow offload teardown/delete even though the state is already
> > > > > CLOSED.
> > > > >
> > > > > Also as a remark we can not assume that the FIN or RST packet is hitting
> > > > > flow table teardown as the packet might get bumped to the slow path in
> > > > > nftables.
> > > >
> > > > I assume this remark is related to 3.?
> > >
> > > Yes, exactly.
> > >
> > > > if IPS_OFFLOAD is unset, then conntrack would update the state
> > > > according to this FIN or RST.
> > >
> > > It will move to a different TCP state anyways only the ct state
> > > will be at IPS_OFFLOAD_BIT and prevent it from beeing garbage collected.
> > > The timeout will be bumped back up as long as IPS_OFFLOAD_BIT is set
> > > even though TCP might already be CLOSED.
>
> I see what you are trying to do here, I have some remarks:
>
> >
> > If teardown fixes the ct state and timeout to established, and IPS_OFFLOAD is
> > unset, then the packet is passed up in a consistent state.
> >
> > I made a patch, it is based on yours, it's attached:
> >
> > - If flow timeout expires or rst/fin is seen, ct state and timeout is
> >   fixed up (to established state) and IPS_OFFLOAD is unset.
> >
> > - If rst/fin packet is seen, ct state and timeout is fixed up (to
> >   established state) and IPS_OFFLOAD is unset. The packet continues
> >   its travel up to the classic path, so conntrack triggers the
> >   transition from established to one of the close states.
> >
> > For the case 1., IPS_OFFLOAD is not set anymore, so conntrack gc
> > cannot race to reset the ct timeout anymore.
> >
> > For the case 2., if gc conntrack ever removes the ct entry, then the
> > IPS_DYING bit is set, which implicitly triggers the teardown state
> > from the flowtable gc. The flowtable still holds a reference to the
> > ct object, so no UAF can happen.
> >
> > For the case 3. the conntrack is set to ESTABLISHED with a long
> > timeout, yes. This is to deal with the two possible cases:
> >
> > a) flowtable timeout expired, so conntrack recovers control on the
> >    flow.
> > b) tcp rst/fin will take back the packet to slow path. The ct has been
> >    fixed up to established state so it will trasition to one of the
> >    close states.
> >
> > Am I missing anything?
>
> You should not fixup the tcp state back to established.
> If flow_offload_teardown is not called because a packet got bumped up to the slow path
> and you call flow_offload_teardown from nf_flow_offload_gc_step, the tcp state might already
> be in CLOSE state and you just moved it back to established.

OK.

> The entire function flow_offload_fixup_tcp can go away if we only allow established tcp states
> in the flowtable.

I'm keeping it, but I remove the reset of the tcp state.

> Same goes for the timeout. The timeout should really be set to the current tcp state
> ct->proto.tcp->state which might not be established anymore.

OK.

> For me the question remains, why can the ct gc not remove the ct when nf_ct_delete
> is called before flow_offload_del is called?

nf_ct_delete() removes indeed the entry from the conntrack table, then
it calls nf_ct_put() which decrements the refcnt. Given that the
flowtable holds a reference to the conntrack object...

 struct flow_offload *flow_offload_alloc(struct nf_conn *ct)
 {
        struct flow_offload *flow;

        if (unlikely(nf_ct_is_dying(ct) ||
            !refcount_inc_not_zero(&ct->ct_general.use)))
                return NULL;

... use-after-free cannot happen. Note that flow_offload_free() calls
nf_ct_put(flow->ct), so at this point the ct object is released.

Is this your concern?

> Also you probably want to move the IPS_OFFLOAD_BIT to the beginning of
> flow_offload_teardown just to make sure that the ct gc is not bumping up the ct timeout
> while it is changed in flow_offload_fixup_ct.

Done.

See patch attached.
>
>

diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
index 20b4a14e5d4e..ebdf5332e838 100644
--- a/net/netfilter/nf_flow_table_core.c
+++ b/net/netfilter/nf_flow_table_core.c
@@ -179,12 +179,11 @@ EXPORT_SYMBOL_GPL(flow_offload_route_init);
 
 static void flow_offload_fixup_tcp(struct ip_ct_tcp *tcp)
 {
-	tcp->state = TCP_CONNTRACK_ESTABLISHED;
 	tcp->seen[0].td_maxwin = 0;
 	tcp->seen[1].td_maxwin = 0;
 }
 
-static void flow_offload_fixup_ct_timeout(struct nf_conn *ct)
+static void flow_offload_fixup_ct(struct nf_conn *ct)
 {
 	struct net *net = nf_ct_net(ct);
 	int l4num = nf_ct_protonum(ct);
@@ -193,7 +192,9 @@ static void flow_offload_fixup_ct_timeout(struct nf_conn *ct)
 	if (l4num == IPPROTO_TCP) {
 		struct nf_tcp_net *tn = nf_tcp_pernet(net);
 
-		timeout = tn->timeouts[TCP_CONNTRACK_ESTABLISHED];
+		flow_offload_fixup_tcp(&ct->proto.tcp);
+
+		timeout = tn->timeouts[ct->proto.tcp.state];
 		timeout -= tn->offload_timeout;
 	} else if (l4num == IPPROTO_UDP) {
 		struct nf_udp_net *tn = nf_udp_pernet(net);
@@ -211,18 +212,6 @@ static void flow_offload_fixup_ct_timeout(struct nf_conn *ct)
 		WRITE_ONCE(ct->timeout, nfct_time_stamp + timeout);
 }
 
-static void flow_offload_fixup_ct_state(struct nf_conn *ct)
-{
-	if (nf_ct_protonum(ct) == IPPROTO_TCP)
-		flow_offload_fixup_tcp(&ct->proto.tcp);
-}
-
-static void flow_offload_fixup_ct(struct nf_conn *ct)
-{
-	flow_offload_fixup_ct_state(ct);
-	flow_offload_fixup_ct_timeout(ct);
-}
-
 static void flow_offload_route_release(struct flow_offload *flow)
 {
 	nft_flow_dst_release(flow, FLOW_OFFLOAD_DIR_ORIGINAL);
@@ -361,22 +350,14 @@ static void flow_offload_del(struct nf_flowtable *flow_table,
 	rhashtable_remove_fast(&flow_table->rhashtable,
 			       &flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].node,
 			       nf_flow_offload_rhash_params);
-
-	clear_bit(IPS_OFFLOAD_BIT, &flow->ct->status);
-
-	if (nf_flow_has_expired(flow))
-		flow_offload_fixup_ct(flow->ct);
-	else
-		flow_offload_fixup_ct_timeout(flow->ct);
-
 	flow_offload_free(flow);
 }
 
 void flow_offload_teardown(struct flow_offload *flow)
 {
+	clear_bit(IPS_OFFLOAD_BIT, &flow->ct->status);
 	set_bit(NF_FLOW_TEARDOWN, &flow->flags);
-
-	flow_offload_fixup_ct_state(flow->ct);
+	flow_offload_fixup_ct(flow->ct);
 }
 EXPORT_SYMBOL_GPL(flow_offload_teardown);
 
@@ -466,7 +447,7 @@ static void nf_flow_offload_gc_step(struct nf_flowtable *flow_table,
 	if (nf_flow_has_expired(flow) ||
 	    nf_ct_is_dying(flow->ct) ||
 	    nf_flow_has_stale_dst(flow))
-		set_bit(NF_FLOW_TEARDOWN, &flow->flags);
+		flow_offload_teardown(flow);
 
 	if (test_bit(NF_FLOW_TEARDOWN, &flow->flags)) {
 		if (test_bit(NF_FLOW_HW, &flow->flags)) {
diff --git a/net/netfilter/nft_flow_offload.c b/net/netfilter/nft_flow_offload.c
index 187b8cb9a510..06c9f72d16db 100644
--- a/net/netfilter/nft_flow_offload.c
+++ b/net/netfilter/nft_flow_offload.c
@@ -298,7 +298,8 @@ static void nft_flow_offload_eval(const struct nft_expr *expr,
 	case IPPROTO_TCP:
 		tcph = skb_header_pointer(pkt->skb, nft_thoff(pkt),
 					  sizeof(_tcph), &_tcph);
-		if (unlikely(!tcph || tcph->fin || tcph->rst))
+		if (unlikely(!tcph || tcph->fin || tcph->rst ||
+			     nf_conntrack_tcp_established(&ct->proto.tcp)))
 			goto out;
 		break;
 	case IPPROTO_UDP: