Re: [PATCH net v2] netfilter: nf_flow_table: fix teardown flow timeout

Sven Auhagen <sven.auhagen@xxxxxxxxxxxx> · Mon, 16 May 2022 20:23:10 +0200

On Mon, May 16, 2022 at 07:50:09PM +0200, Pablo Neira Ayuso wrote:
> On Mon, May 16, 2022 at 03:02:13PM +0200, Sven Auhagen wrote:
> > On Mon, May 16, 2022 at 02:43:06PM +0200, Pablo Neira Ayuso wrote:
> > > On Mon, May 16, 2022 at 02:23:00PM +0200, Sven Auhagen wrote:
> > > > On Mon, May 16, 2022 at 02:13:03PM +0200, Pablo Neira Ayuso wrote:
> > > > > On Mon, May 16, 2022 at 12:56:41PM +0200, Pablo Neira Ayuso wrote:
> > > > > > On Thu, May 12, 2022 at 09:28:03PM +0300, Oz Shlomo wrote:
> [...]
> > > > > > [...]
> > > > > > > diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> > > > > > > index 0164e5f522e8..324fdb62c08b 100644
> > > > > > > --- a/net/netfilter/nf_conntrack_core.c
> > > > > > > +++ b/net/netfilter/nf_conntrack_core.c
> > > > > > > @@ -1477,7 +1477,8 @@ static void gc_worker(struct work_struct *work)
> > > > > > >  			tmp = nf_ct_tuplehash_to_ctrack(h);
> > > > > > >  
> > > > > > >  			if (test_bit(IPS_OFFLOAD_BIT, &tmp->status)) {
> > > > > > > -				nf_ct_offload_timeout(tmp);
> > > > > > 
> > > > > > Hm, it is the trick to avoid checking for IPS_OFFLOAD from the packet
> > > > > > path that triggers the race, ie. nf_ct_is_expired()
> > > > > > 
> > > > > > The flowtable ct fixup races with conntrack gc collector.
> > > > > > 
> > > > > > Clearing IPS_OFFLOAD might result in offloading the entry again for
> > > > > > the closing packets.
> > > > > > 
> > > > > > Probably clear IPS_OFFLOAD from teardown, and skip offload if flow is
> > > > > > in a TCP state that represent closure?
> > > > > > 
> > > > > >   		if (unlikely(!tcph || tcph->fin || tcph->rst))
> > > > > >   			goto out;
> > > > > > 
> > > > > > this is already the intention in the existing code.
> > > > > 
> > > > > I'm attaching an incomplete sketch patch. My goal is to avoid the
> > > > > extra IPS_ bit.
> > > > 
> > > > You might create a race with ct gc that will remove the ct
> > > > if it is in close or end of close and before flow offload teardown is running
> > > > so flow offload teardown might access memory that was freed.
> > > 
> > > flow object holds a reference to the ct object until it is released,
> > > no use-after-free can happen.
> > > 
> > 
> > Also if nf_ct_delete is called before flowtable delete?
> > Can you let me know why?
> 
> nf_ct_delete() removes the conntrack object from lists and it
> decrements the reference counter by one.
> 
> flow_offload_free() also calls nf_ct_put(). flow_offload_alloc() bumps
> the reference count on the conntrack object before creating the flow.
> 
> > > > It is not a very likely scenario but never the less it might happen now
> > > > since the IPS_OFFLOAD_BIT is not set and the state might just time out.
> > > > 
> > > > If someone sets a very small TCP CLOSE timeout it gets more likely.
> > > > 
> > > > So Oz and myself were debatting about three possible cases/problems:
> > > > 
> > > > 1. ct gc sets timeout even though the state is in CLOSE/FIN because the
> > > > IPS_OFFLOAD is still set but the flow is in teardown
> > > > 2. ct gc removes the ct because the IPS_OFFLOAD is not set and
> > > > the CLOSE timeout is reached before the flow offload del
> > > 
> > > OK.
> > > 
> > > > 3. tcp ct is always set to ESTABLISHED with a very long timeout
> > > > in flow offload teardown/delete even though the state is already
> > > > CLOSED.
> > > >
> > > > Also as a remark we can not assume that the FIN or RST packet is hitting
> > > > flow table teardown as the packet might get bumped to the slow path in
> > > > nftables.
> > > 
> > > I assume this remark is related to 3.?
> > 
> > Yes, exactly.
> > 
> > > if IPS_OFFLOAD is unset, then conntrack would update the state
> > > according to this FIN or RST.
> > 
> > It will move to a different TCP state anyways only the ct state
> > will be at IPS_OFFLOAD_BIT and prevent it from beeing garbage collected.
> > The timeout will be bumped back up as long as IPS_OFFLOAD_BIT is set
> > even though TCP might already be CLOSED.

I see what you are trying to do here, I have some remarks:

> 
> If teardown fixes the ct state and timeout to established, and IPS_OFFLOAD is
> unset, then the packet is passed up in a consistent state.
> 
> I made a patch, it is based on yours, it's attached:
> 
> - If flow timeout expires or rst/fin is seen, ct state and timeout is
>   fixed up (to established state) and IPS_OFFLOAD is unset.
> 
> - If rst/fin packet is seen, ct state and timeout is fixed up (to
>   established state) and IPS_OFFLOAD is unset. The packet continues
>   its travel up to the classic path, so conntrack triggers the
>   transition from established to one of the close states.
> 
> For the case 1., IPS_OFFLOAD is not set anymore, so conntrack gc
> cannot race to reset the ct timeout anymore.
> 
> For the case 2., if gc conntrack ever removes the ct entry, then the
> IPS_DYING bit is set, which implicitly triggers the teardown state
> from the flowtable gc. The flowtable still holds a reference to the
> ct object, so no UAF can happen.
> 
> For the case 3. the conntrack is set to ESTABLISHED with a long
> timeout, yes. This is to deal with the two possible cases:
> 
> a) flowtable timeout expired, so conntrack recovers control on the
>    flow.
> b) tcp rst/fin will take back the packet to slow path. The ct has been
>    fixed up to established state so it will trasition to one of the
>    close states.
> 
> Am I missing anything?

You should not fixup the tcp state back to established.
If flow_offload_teardown is not called because a packet got bumped up to the slow path
and you call flow_offload_teardown from nf_flow_offload_gc_step, the tcp state might already
be in CLOSE state and you just moved it back to established.
The entire function flow_offload_fixup_tcp can go away if we only allow established tcp states
in the flowtable.

Same goes for the timeout. The timeout should really be set to the current tcp state
ct->proto.tcp->state which might not be established anymore.

For me the question remains, why can the ct gc not remove the ct when nf_ct_delete
is called before flow_offload_del is called?

Also you probably want to move the IPS_OFFLOAD_BIT to the beginning of
flow_offload_teardown just to make sure that the ct gc is not bumping up the ct timeout
while it is changed in flow_offload_fixup_ct.