Re: [PATCH bpf-next v2 0/8] Support defragmenting IPv(4|6) packets in BPF

Daniel Xu <dxu@xxxxxxxxx> · Tue, 28 Feb 2023 16:17:16 -0700

Hi Alexei,

On Mon, Feb 27, 2023 at 08:56:38PM -0800, Alexei Starovoitov wrote:
> On Mon, Feb 27, 2023 at 5:57 PM Daniel Xu <dxu@xxxxxxxxx> wrote:
> >
> > Hi Alexei,
> >
> > On Mon, Feb 27, 2023 at 03:03:38PM -0800, Alexei Starovoitov wrote:
> > > On Mon, Feb 27, 2023 at 12:51:02PM -0700, Daniel Xu wrote:
> > > > === Context ===
> > > >
> > > > In the context of a middlebox, fragmented packets are tricky to handle.
> > > > The full 5-tuple of a packet is often only available in the first
> > > > fragment which makes enforcing consistent policy difficult. There are
> > > > really only two stateless options, neither of which are very nice:
> > > >
> > > > 1. Enforce policy on first fragment and accept all subsequent fragments.
> > > >    This works but may let in certain attacks or allow data exfiltration.
> > > >
> > > > 2. Enforce policy on first fragment and drop all subsequent fragments.
> > > >    This does not really work b/c some protocols may rely on
> > > >    fragmentation. For example, DNS may rely on oversized UDP packets for
> > > >    large responses.
> > > >
> > > > So stateful tracking is the only sane option. RFC 8900 [0] calls this
> > > > out as well in section 6.3:
> > > >
> > > >     Middleboxes [...] should process IP fragments in a manner that is
> > > >     consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes
> > > >     must maintain state in order to achieve this goal.
> > > >
> > > > === BPF related bits ===
> > > >
> > > > However, when policy is enforced through BPF, the prog is run before the
> > > > kernel reassembles fragmented packets. This leaves BPF developers in a
> > > > awkward place: implement reassembly (possibly poorly) or use a stateless
> > > > method as described above.
> > > >
> > > > Fortunately, the kernel has robust support for fragmented IP packets.
> > > > This patchset wraps the existing defragmentation facilities in kfuncs so
> > > > that BPF progs running on middleboxes can reassemble fragmented packets
> > > > before applying policy.
> > > >
> > > > === Patchset details ===
> > > >
> > > > This patchset is (hopefully) relatively straightforward from BPF perspective.
> > > > One thing I'd like to call out is the skb_copy()ing of the prog skb. I
> > > > did this to maintain the invariant that the ctx remains valid after prog
> > > > has run. This is relevant b/c ip_defrag() and ip_check_defrag() may
> > > > consume the skb if the skb is a fragment.
> > >
> > > Instead of doing all that with extra skb copy can you hook bpf prog after
> > > the networking stack already handled ip defrag?
> > > What kind of middle box are you doing? Why does it have to run at TC layer?
> >
> > Unless I'm missing something, the only other relevant hooks would be
> > socket hooks, right?
> >
> > Unfortunately I don't think my use case can do that. We are running the
> > kernel as a router, so no sockets are involved.
> 
> Are you using bpf_fib_lookup and populating kernel routing
> table and doing everything on your own including neigh ?

We're currently not doing any routing things in BPF yet. All the routing
manipulation has been done in iptables / netfilter so far. I'm not super
familiar with routing stuff but from what I understand there is some
relatively complicated stuff going on with BGP and ipsec tunnels at the
moment. Not sure if that answers your question.

> Have you considered to skb redirect to another netdev that does ip defrag?
> Like macvlan does it under some conditions. This can be generalized.

I had not considered that yet. Are you suggesting adding a new
passthrough netdev thing that'll defrags? I looked at the macvlan driver
and it looks like it defrags to handle some multicast corner case.

> Recently Florian proposed to allow calling bpf progs from all existing
> netfilter hooks.
> You can pretend to local deliver and hook in NF_INET_LOCAL_IN ?

Does that work for forwarding cases? I'm reading through [0] and it
seems to suggest that it'll only defrag for locally destined packets:

    If the destination IP address is matches with
    local NIC's IP address, the dst_input() function will brings the packets
    into the ip_local_deliver(), which will defrag the packet and pass it
    to the NF_IP_LOCAL_IN hook

Faking local delivery seems kinda ugly -- maybe I don't know any clean
ways.

[...]

[0]: https://kernelnewbies.org/Networking?action=AttachFile&do=get&target=hacking_the_wholism_of_linux_net.txt

Thanks,
Daniel