Re: [PATCH nf] netfilter: nf_tables: do not defer rule destruction via call_rcu

Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> · Wed, 11 Dec 2024 17:16:11 +0100

Hi Florian

On Mon, Dec 09, 2024 at 11:04:01PM +0100, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> wrote:
> > > One way would be to allocate nft_trans_rule objects + one nft_trans_chain
> > > object, deactivate the rules + the chain and then defer the freeing to the
> > > nft destroy workqueue.  We'd still need to keep the synchronize_rcu path as
> > > a fallback to handle -ENOMEM corner cases though.
> > 
> > I think it can be done _without_ nft_trans objects.
> > 
> > Since the commit mutex is held in this netdev event path: Remove this
> > basechain, deactivate rules and add basechain to global list protected
> > with spinlock, it invokes worker. Then, worker zaps this list
> > basechains, it calls synchronize_rcu() and it destroys rules and then
> > basechain. No memory allocations needed in this case?
> 
> I don't think its possible due to netlink dumps.
> 
> Its safe for normal commit path because list_del_rcu() gets called on
> these objects (making them unreachable for new dumps), then,
> synchronize_rcu is called (so all concurrent dumpers are done) and then
> we free.
> 
> It would work if you add a new list_head to the basechain, then you
> could just link the basechain object to some other list and then
> call nf_tables_rule_release() AFTER a synchronize_rcu because rules
> can't be picked up if the owning chain is no longer reachable.
> 
> However, I do wonder how complex it would be.

I started a patch, spent several hours in a row, but I need more time
to get this right.

There is also Phil making the line to remove this in nf-next.

> This is tricky to get right and I'm not sure this patch adds a
> noticeable issue, see nft_rcv_nl_event() which has similar pattern,
> i.e. synchronize_rcu() is called.

I can see veth hits synchronize_rcu(), this can be called from
default_device_exit_batch() which calls unregister_netdevice_many().

I need time, but it is difficult. We have to deliver a pull request
very late on wednesday so it can be taken on thursday.

Now the week has three days to fix and test stuff, so upstream
maintainers have a day to tidy up and submit upstream.

> I'm not saying we should not consider async route, but maybe better
> to followup in -next?

Yes, I can follow up in -next, but then Phil just remove this again in
-next.

I can just take this fix to address the existing situation which is
worse than before. Before this patch UAF was possible, although I
never manage to trigger it. Now crash is easy to trigger, I spent too
much time looking at the interaction with layers and I forgot basic
assumption regarding nf_tables in this patch.