Hi Alexei, (cc netfilter maintainers) On Mon, Mar 06, 2023 at 08:17:20PM -0800, Alexei Starovoitov wrote: > On Tue, Feb 28, 2023 at 3:17 PM Daniel Xu <dxu@xxxxxxxxx> wrote: > > > > > Have you considered to skb redirect to another netdev that does ip defrag? > > > Like macvlan does it under some conditions. This can be generalized. > > > > I had not considered that yet. Are you suggesting adding a new > > passthrough netdev thing that'll defrags? I looked at the macvlan driver > > and it looks like it defrags to handle some multicast corner case. > > Something like that. A netdev that bpf prog can redirect too. > It will consume ip frags and eventually will produce reassembled skb. > > The kernel ip_defrag logic has timeouts, counters, rhashtable > with thresholds, etc. All of them are per netns. > Just another ip_defrag_user will still share rhashtable > with its limits. The kernel can even do icmp_send(). > ip_defrag is not a kfunc. It's a big block with plenty of kernel > wide side effects. > I really don't think we can alloc_skb, copy_skb, and ip_defrag it. > It messes with the stack too much. > It's also not clear to me when skb is reassembled and how bpf sees it. > "redirect into reassembling netdev" and attaching bpf prog to consume > that skb is much cleaner imo. > May be there are other ways to use ip_defrag, but certainly not like > synchronous api helper. I was giving the virtual netdev idea some thought this morning and I thought I'd give the netfilter approach a deeper look. >From my reading (I'll run some tests later) it looks like netfilter will defrag all ipv4/ipv6 packets in any netns with conntrack enabled. It appears to do so in NF_INET_PRE_ROUTING. Unfortunately that does run after tc hooks. But fortunately with the new BPF netfilter hooks I think we can make defrag work outside of BPF kfuncs like you want. And the NF_IP_FORWARD hook works well for my router use case. One thing we would need though are (probably kfunc) wrappers around nf_defrag_ipv4_enable() and nf_defrag_ipv6_enable() to ensure BPF progs are not transitively depending on defrag support from other netfilter modules. The exact mechanism would probably need some thinking, as the above functions kinda rely on module_init() and module_exit() semantics. We cannot make the prog bump the refcnt every time it runs -- it would overflow. And it would be nice to automatically free the refcnt when prog is unloaded. Once the netfilter prog type series lands I can get that discussion started. Unless Daniel feels strongly that we should continue with the approach in this patchset, I am leaning towards dropping in favor of netfilter approach. Thanks, Daniel