Hi Alexei, On Mon, Feb 27, 2023 at 08:56:38PM -0800, Alexei Starovoitov wrote: > On Mon, Feb 27, 2023 at 5:57 PM Daniel Xu <dxu@xxxxxxxxx> wrote: > > > > Hi Alexei, > > > > On Mon, Feb 27, 2023 at 03:03:38PM -0800, Alexei Starovoitov wrote: > > > On Mon, Feb 27, 2023 at 12:51:02PM -0700, Daniel Xu wrote: > > > > === Context === > > > > > > > > In the context of a middlebox, fragmented packets are tricky to handle. > > > > The full 5-tuple of a packet is often only available in the first > > > > fragment which makes enforcing consistent policy difficult. There are > > > > really only two stateless options, neither of which are very nice: > > > > > > > > 1. Enforce policy on first fragment and accept all subsequent fragments. > > > > This works but may let in certain attacks or allow data exfiltration. > > > > > > > > 2. Enforce policy on first fragment and drop all subsequent fragments. > > > > This does not really work b/c some protocols may rely on > > > > fragmentation. For example, DNS may rely on oversized UDP packets for > > > > large responses. > > > > > > > > So stateful tracking is the only sane option. RFC 8900 [0] calls this > > > > out as well in section 6.3: > > > > > > > > Middleboxes [...] should process IP fragments in a manner that is > > > > consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes > > > > must maintain state in order to achieve this goal. > > > > > > > > === BPF related bits === > > > > > > > > However, when policy is enforced through BPF, the prog is run before the > > > > kernel reassembles fragmented packets. This leaves BPF developers in a > > > > awkward place: implement reassembly (possibly poorly) or use a stateless > > > > method as described above. > > > > > > > > Fortunately, the kernel has robust support for fragmented IP packets. > > > > This patchset wraps the existing defragmentation facilities in kfuncs so > > > > that BPF progs running on middleboxes can reassemble fragmented packets > > > > before applying policy. > > > > > > > > === Patchset details === > > > > > > > > This patchset is (hopefully) relatively straightforward from BPF perspective. > > > > One thing I'd like to call out is the skb_copy()ing of the prog skb. I > > > > did this to maintain the invariant that the ctx remains valid after prog > > > > has run. This is relevant b/c ip_defrag() and ip_check_defrag() may > > > > consume the skb if the skb is a fragment. > > > > > > Instead of doing all that with extra skb copy can you hook bpf prog after > > > the networking stack already handled ip defrag? > > > What kind of middle box are you doing? Why does it have to run at TC layer? > > > > Unless I'm missing something, the only other relevant hooks would be > > socket hooks, right? > > > > Unfortunately I don't think my use case can do that. We are running the > > kernel as a router, so no sockets are involved. > > Are you using bpf_fib_lookup and populating kernel routing > table and doing everything on your own including neigh ? We're currently not doing any routing things in BPF yet. All the routing manipulation has been done in iptables / netfilter so far. I'm not super familiar with routing stuff but from what I understand there is some relatively complicated stuff going on with BGP and ipsec tunnels at the moment. Not sure if that answers your question. > Have you considered to skb redirect to another netdev that does ip defrag? > Like macvlan does it under some conditions. This can be generalized. I had not considered that yet. Are you suggesting adding a new passthrough netdev thing that'll defrags? I looked at the macvlan driver and it looks like it defrags to handle some multicast corner case. > Recently Florian proposed to allow calling bpf progs from all existing > netfilter hooks. > You can pretend to local deliver and hook in NF_INET_LOCAL_IN ? Does that work for forwarding cases? I'm reading through [0] and it seems to suggest that it'll only defrag for locally destined packets: If the destination IP address is matches with local NIC's IP address, the dst_input() function will brings the packets into the ip_local_deliver(), which will defrag the packet and pass it to the NF_IP_LOCAL_IN hook Faking local delivery seems kinda ugly -- maybe I don't know any clean ways. [...] [0]: https://kernelnewbies.org/Networking?action=AttachFile&do=get&target=hacking_the_wholism_of_linux_net.txt Thanks, Daniel