On Tue, Sep 08, 2020 at 02:55:36PM +0200, Daniel Borkmann wrote: > I would strongly prefer something where nf integrates into existing > tc hook, not only due to the hook reuse which would be better, > but also to allow for a more flexible interaction between tc/BPF > use cases [...] > one option to move forward [...] overall rework of ingress/egress > side to be a more flexible pipeline (think of cont/ok actions > as with tc filters or stackable LSMs to process & delegate). Interaction between netfilter and tc is facilitated by skb->mark. Both netfilter and tc are able to set and match by way of the mark. E.g. a netfilter hook may set the mark and tc may later perform an action if a matching mark is found. Because the placement of netfilter and tc hooks in the data path has been unchanged for decades, we must assume that users depend on their order for setting and matching the mark. Thus, reworking the data path in the way you suggest (a flexible pipeline) must not change the order of the hooks. It would have to be a fixed pipeline. But what's the benefit then compared to separate netfilter and tc hooks which are patched in at runtime and become NOPs if not used? (Which is what the present series is aiming for.) > to name one example... consider two different entities in the system > setting up the two, that is, one adding rules for nf ingress/egress > on the phys device for host fw and the other one for routing traffic > into/from containers at the tc layer, then traffic going into host ns > will hit nf ingress and on egress side the nf egress part; however, > traffic going to containers via existing tc redirect will not see the > nf ingress as expected but would on reverse path incorrectly > hit the nf egress one which is /not/ the case for dev_queue_xmit() today. Using tc to bounce ingress traffic into a container -- is that actually a thing or is it a hypothetical example? I think at least Docker uses plain vanilla routing and bridging to move packets in and out of containers. However you're right that if tc *is* used to redirect ingress packets to a container veth, then the data path would look like: host tc -> container tc -> container nft Whereas the egress data path would look like: container nft -> container tc -> host nft -> host tc But I argue that the egress data path is actually correct because the host must be able to firewall packets coming out of the container in case the container has been compromised. > And if you check a typical DHCP client that is present on major > modern distros like systemd-networkd's DHCP client then they > already implement filtering of malicious packets via BPF at > socket layer including checking for cookies in the DHCP header > that are set by the application itself to prevent spoofing [0]. > > [0] https://github.com/systemd/systemd/blob/master/src/libsystemd-network/dhcp-network.c#L28 That's an *ingress* filter so that user space only receives DHCP packets and nothing else. We're talking about the ability to filter *egress* DHCP packets (among others) at the kernel level to guard against unwanted packets coming from user space. Thanks, Lukas