On 1/22/21 9:47 AM, Lukas Wunner wrote:
Commit e687ad60af09 ("netfilter: add netfilter ingress hook after handle_ing() under unique static key") introduced the ability to classify packets with netfilter on ingress. Support the same on egress to satisfy user requirements such as: * outbound security policies for containers (Laura) * filtering and mangling intra-node Direct Server Return (DSR) traffic on a load balancer (Laura) * filtering locally generated traffic coming in through AF_PACKET, such as local ARP traffic generated for clustering purposes or DHCP (Laura; the AF_PACKET plumbing is contained in a separate commit) * L2 filtering from ingress and egress for AVB (Audio Video Bridging) and gPTP with nftables (Pablo) * in the future: in-kernel NAT64/NAT46 (Pablo) A patch for nftables to hook up egress rules from user space has been submitted separately, so users may immediately take advantage of the feature. The hook is positioned after packet handling by traffic control. Thus, if packets are redirected into and out of containers with tc, the data path is: ingress: host tc -> container tc -> container nft egress: container tc -> host tc -> host nft This was done to address an objection from Daniel Borkmann: If desired, nft does not get into tc's way performance-wise. The host is able to firewall malicious packets coming out of a container, but only after tc has done its duty. An implication is that tc may set skb->mark on egress for nft to act on it, but not the other way round. If egress netfilter handling is not enabled on any interface, it is patched out of the data path by way of a static_key and doesn't make a performance difference that is discernible from noise:
[...]
@@ -4096,13 +4098,18 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) qdisc_pkt_len_init(skb); #ifdef CONFIG_NET_CLS_ACT skb->tc_at_ingress = 0; -# ifdef CONFIG_NET_EGRESS +#endif +#ifdef CONFIG_NET_EGRESS if (static_branch_unlikely(&egress_needed_key)) { skb = sch_handle_egress(skb, &rc, dev); if (!skb) goto out; + if (nf_hook_egress_active()) { + skb = nf_hook_egress(skb, &rc, dev); + if (!skb) + goto out; + }
This won't work unfortunately, the issue is that this now creates an asymmetric path, for example: [tc ingress] -> [nf ingress] -> [host] -> [tc egress] -> [nf egress]. So any NAT translation done on tc ingress + tc egress will break on the nf hooks given nf is not able to correlate inbound with outbound traffic. Likewise for container-based traffic that is in its own netns, one of the existing paths is: [tc ingress (phys,hostns)] -> [tc ingress (veth,podns)] -> [reply] -> [tc egress (veth,podns)] -> [tc ingress (veth,hostns)] -> [nf egress (phys,hostns)*] -> [tc egress (phys,hostns)]. As can be seen the [nf ingress] hook is not an issue at all given everything operates in tc layer but the [nf egress*] one is in this case, as it touches in tc layer where existing data planes will start to break upon rule injection. Wrt above objection, what needs to be done for the minimum case is to i) fix the asymmetry problem from here, and ii) add a flag for tc layer-redirected skbs to skip the nf egress hook *; with that in place this concern should be resolved. Thanks!
} -# endif #endif /* If device/qdisc don't need skb->dst, release it right now while * its hot in this cpu cache.