Florian Westphal <fw@xxxxxxxxx> writes: > Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: >> > Lookups should be fine. Insertions are the problem. >> > >> > NAT hooks are expected to execute before the insertion into the >> > conntrack table. >> > >> > If you insert before, NAT hooks won't execute, i.e. >> > rules that use dnat/redirect/masquerade have no effect. >> >> Well yes, if you insert the wrong state into the conntrack table, you're >> going to get wrong behaviour. That's sorta expected, there are lots of >> things XDP can do to disrupt the packet flow (like just dropping the >> packets :)). > > Sure, but I'm not sure I understand the use case. > > Insertion at XDP layer turns off netfilters NAT capability, so its > incompatible with the classic forwarding path. > > If thats fine, why do you need to insert into the conntrack table to > begin with? The entire infrastructure its designed for is disabled... One of the major selling points of XDP is that you can reuse the existing kernel infrastructure instead of having to roll your own. So sure, one could implement their own conntrack using BPF maps (as indeed, e.g., Cilium has done), but why do that when you can take advantage of the existing one in the kernel? Same reason we have the bpf_fib_lookup() helper... >> > I don't think there is anything that stands in the way of replicating >> > this via XDP. >> >> What I want to be able to do is write an XDP program that does the following: >> >> 1. Parse the packet header and determine if it's a packet type we know >> how to handle. If not, just return XDP_PASS and let the stack deal >> with corner cases. >> >> 2. If we know how to handle the packet (say, it's TCP or UDP), do a >> lookup into conntrack to figure out if there's state for it and we >> need to do things like NAT. >> >> 3. If we need to NAT, rewrite the packet based on the information we got >> back from conntrack. > > You could already do that by storing that info in bpf maps The > ctnetlink event generated on conntrack insertion contains the NAT > mapping information, so you could have a userspace daemon that > intercepts those to update the map. Sure, but see above. >> 4. Update the conntrack state to be consistent with the packet, and then >> redirect it out the destination interface. >> >> I.e., in the common case the packet doesn't go through the stack at all; >> but we need to make conntrack aware that we processed the packet so the >> entry doesn't expire (and any state related to the flow gets updated). > > In the HW offload case, conntrack is bypassed completely. There is an > IPS_(HW)_OFFLOAD_BIT that prevents the flow from expiring. That's comparable in execution semantics (stack is bypassed entirely), but not in control plane semantics (we lookup from XDP instead of pushing flows down to an offload). >> Ideally we should also be able to create new state for a flow we haven't >> seen before. > > The way HW offload was intended to work is to allow users to express > what flows should be offloaded via 'flow add' expression in nftables, so > they can e.g. use byte counters or rate estimators etc. to make such > a decision. So initial packet always passes via normal stack. > > This is also needed to consider e.g. XFRM -- nft_flow_offload.c won't > offload if the packet has a secpath attached (i.e., will get encrypted > later). > > I suspect we'd want a way to notify/call an ebpf program instead so we > can avoid the ctnetlink -> userspace -> update dance and do the XDP > 'flow bypass information update' from inside the kernel and ebpf/XDP > reimplementation of the nf flow table (it uses the netfilter ingress > hook on the configured devices; everyhing it does should be doable > from XDP). But the point is exactly that we don't have to duplicate the state into BPF, we can make XDP look it up directly. >> This requires updating of state, but I see no reason why this shouldn't >> be possible? > > Updating ct->status is problematic, there would have to be extra checks > that prevent non-atomic writes and toggling of special bits such as > CONFIRMED, TEMPLATE or DYING. Adding a helper to toggle something > specific, e.g. the offload state bit, should be okay. We can certainly constrain the update so it's not possible to get into an unsafe state. The primary use case is accelerating the common case, punting to the stack is fine for corner cases. -Toke