Re: [netfilter-core] [Q] The usage of xt_recseq.

Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx> · Tue, 13 Aug 2024 17:28:10 +0200

On 2024-08-13 16:37:19 [+0200], Florian Westphal wrote:
> Hi Sebastian!
Hi Florian,

> > While trying to trigger the writer side, I managed only to trigger a
> > single reader and only while using iptables-legacy/ arptables-legacy
> > commands. The nft did not trigger it. So it is legacy code only.
> 
> Yes, this is legacy only.

perfect.

> > Would it work to convert the counters to u64_stats_sync? On 32bit
> > there would be a seqcount_t with preemption disabling during the
> > update which means the xt_write_recseq_begin()/ xt_write_recseq_end()
> > has to be limited the counter update only. On 64bit architectures there
> > would be just the update. This means that number of packets and bytes
> > might be "off" (the one got updated, the other not "yet") but I don't
> > think that this is a problem here.
> 
> Unfortunately its not only about counters; local_bh_disable() is also
> used to prevent messing up the chain jump stack.

Okay. But I could get rid of the counters/ seqcount and worry about the
other things later on?

> For local hooks, this is called from process context, so in order
> to avoid timers kicking in and then re-using the jumpstack, this
> local_bh_disable avoids that.
> 
> The chain stack is percpu in -legacy, and on-stack in nf_tables.
> 
> Then, there is also recursion via xt_TEE.c, hence this strange
>         if (static_key_false(&xt_tee_enabled))
> 
> in ipt_do_table() (We'll switch to a shadow-stack for that case).

Yeah. That is another per-CPU thingy. 
So my plan was to replace the seqcount/ counters with u64_stats_sync,
make this per-CPU nf_skb_duplicated per-TASK and then come back and
check what local_bh_disable() is actually protecting other than skb
alloc/ free during callback invocation (as in TEE).

So jumpstack. This is exclusively used by ipt_do_table(). Not sure how a
timer comes here but I goes any softirq (as in NAPI) would do the job
without actually disabling BH.
Can this be easily transformed to the on-stack thingy that nft is using
or is it completely different? If the latter I would just a add a lock. 
local_lock_nested_bh() would be the easiest to not upset anyone but this
is using hand crafted per-CPU memory instead of alloc_percpu(). Can
stacksize get extremely huge?

In case I made a wrong turn somewhere, do you have better suggestions?

Sebastian