Re: [PATCH nf-next v3 3/3] netfilter: Introduce egress hook

Lukas Wunner <lukas@xxxxxxxxx> · Sun, 11 Oct 2020 09:59:05 +0200

On Tue, Sep 08, 2020 at 02:55:36PM +0200, Daniel Borkmann wrote:
> I would strongly prefer something where nf integrates into existing
> tc hook, not only due to the hook reuse which would be better,
> but also to allow for a more flexible interaction between tc/BPF
> use cases
[...]
> one option to move forward [...] overall rework of ingress/egress
> side to be a more flexible pipeline (think of cont/ok actions
> as with tc filters or stackable LSMs to process & delegate).

Interaction between netfilter and tc is facilitated by skb->mark.
Both netfilter and tc are able to set and match by way of the mark.
E.g. a netfilter hook may set the mark and tc may later perform an
action if a matching mark is found.

Because the placement of netfilter and tc hooks in the data path
has been unchanged for decades, we must assume that users depend
on their order for setting and matching the mark.

Thus, reworking the data path in the way you suggest (a flexible
pipeline) must not change the order of the hooks.  It would have
to be a fixed pipeline.  But what's the benefit then compared to
separate netfilter and tc hooks which are patched in at runtime
and become NOPs if not used?  (Which is what the present series is
aiming for.)

> to name one example... consider two different entities in the system
> setting up the two, that is, one adding rules for nf ingress/egress
> on the phys device for host fw and the other one for routing traffic
> into/from containers at the tc layer, then traffic going into host ns
> will hit nf ingress and on egress side the nf egress part; however,
> traffic going to containers via existing tc redirect will not see the
> nf ingress as expected but would on reverse path incorrectly
> hit the nf egress one which is /not/ the case for dev_queue_xmit() today.

Using tc to bounce ingress traffic into a container -- is that actually
a thing or is it a hypothetical example?  I think at least Docker uses
plain vanilla routing and bridging to move packets in and out of
containers.

However you're right that if tc *is* used to redirect ingress packets
to a container veth, then the data path would look like:

host tc -> container tc -> container nft

Whereas the egress data path would look like:

container nft -> container tc -> host nft -> host tc

But I argue that the egress data path is actually correct because the
host must be able to firewall packets coming out of the container
in case the container has been compromised.

> And if you check a typical DHCP client that is present on major
> modern distros like systemd-networkd's DHCP client then they
> already implement filtering of malicious packets via BPF at
> socket layer including checking for cookies in the DHCP header
> that are set by the application itself to prevent spoofing [0].
> 
> [0] https://github.com/systemd/systemd/blob/master/src/libsystemd-network/dhcp-network.c#L28

That's an *ingress* filter so that user space only receives DHCP
packets and nothing else.

We're talking about the ability to filter *egress* DHCP packets
(among others) at the kernel level to guard against unwanted
packets coming from user space.

Thanks,

Lukas