Re: [PATCH nf-next v4 4/5] netfilter: Introduce egress hook

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Lukas,

On 9/11/21 11:26 PM, Lukas Wunner wrote:
On Tue, Jan 26, 2021 at 08:13:19PM +0100, Daniel Borkmann wrote:
On 1/22/21 9:47 AM, Lukas Wunner wrote:
[...]
Wrt above objection, what needs to be done for the minimum case is to
i) fix the asymmetry problem from here, and
ii) add a flag for tc layer-redirected skbs to skip the nf egress hook *;
with that in place this concern should be resolved. Thanks!

Daniel, your reply above has left me utterly confused.  After thinking
about it for a while, I'm requesting clarification what you really want.
We do want the netfilter egress hook in mainline and we're happy to
accommodate to your requirements, but we need you to specify them clearly.

Regarding the data path for packets which are going out from a container,
I think you've erroneously mixed up the last two elements above:
If the nf egress hook is placed after tc egress (as is done in the present
patch), the data path is actually as follows:

   [tc egress (veth,podns)] -> [tc ingress (veth,hostns)] ->
   [tc egress (phys,hostns)] -> [nf egress (phys,hostns)*]

By contrast, if nf egress is put in front of tc egress (as you're
requesting above), the data path becomes:

   [nf egress (veth,podns)] -> [tc egress (veth,podns)] ->
   [tc ingress (veth,hostns)] ->
   [nf egress (phys,hostns)*] -> [tc egress (phys,hostns)]

So this order results in an *additional* nf egress hook in the data path,
which may cost performance.  Previously you voiced concerns that the
nf egress hook may degrade performance.  To address that concern,
we ordered nf egress *after* tc egress, thereby avoiding any performance
impact as much as possible.

I agree with your argument that an inverse order vis-a-vis ingress is
more logical because it avoids NAT incongruencies etc.  So I'll be happy
to reorder nf egress before tc egress.  I'm just confused that you're now
requesting an order which may be *more* detrimental to performance.

Right, the issue is that placing it either in front or after the existing tc
egress hook just as-is results in layering violations given both tc/nf subsystems
will inevitably step on each other when both dataplanes are used by different
components where things break. To provide a walk-through on what I think would
work w/o breaking layers:

1) Packet going up hostns stack. Here, we'll end up with:

   [tc ingress (phys,hostns)] -> [nf ingress (phys,hostns)] ->
     upper stack ->
       [nf egress (phys,hostns)] -> [tc egress (phys,hostns)]

2) Packet going up podns stack. Here, if the forwarding happens entirely in tc
   layer, then the datapath (as-is today) operates _below_ nf and must look like:

   [tc ingress (phys,hostns)] -> [tc egress (veth,hostns)] ->
     netns switch -> (*) -> netns switch ->
       [tc ingress (veth,hostns)] -> [tc egress (phys,hostns)]

   For simplicity, I left out (*), but it's essentially the same as case 1)
   just for the Pod's/container's stack PoV.

3) Packet is locally forwarded between Pods. Same as 2) for the case where the
   forwarding happens entirely in tc (as-is today) which operates _below_ nf and
   must look like:

   (+) -> [tc ingress (veth,hostns)] -> [tc egress (veth,hostns)] ->
     netns switch -> (*) -> netns switch ->
       [tc ingress (veth,hostns)] -> [tc egress (veth,hostns)] -> (+)

   The (+) denotes the src/sink where we enter/exit the hostns after netns switch.

The skb bit marker would be that tc [& BPF]-related redirect functions are internally
setting that bit, so that nf egress hook is skipped for cases like 2)/3).

Walking through a similar 1/2/3) scenario from nf side one layer higher if you were
to do an equivalent, things would look like:

1) Packet going up hostns stack. Same as above:

   [tc ingress (phys,hostns)] -> [nf ingress (phys,hostns)] ->
     upper stack ->
       [nf egress (phys,hostns)] -> [tc egress (phys,hostns)]

2) Packet going up podns stack with forwarding from nf side:

   [tc ingress (phys,hostns)] -> [nf ingress (phys,hostns)] ->
     [nf egress (veth,hostns)] -> [tc egress (veth,hostns)] ->
       netns switch -> (*) -> netns switch ->
         [tc ingress (veth,hostns)] -> [nf ingress (veth,hostns)] ->
           [nf egress (phys,hostns)] -> [tc egress (phys,hostns)]

3) Packet is locally forwarded between Pods with forwarding from nf side:

   (+) -> [tc ingress (veth,hostns)] -> [nf ingress (veth,hostns)] ->
     [nf egress (veth,hostns)] -> [tc egress (veth,hostns)] ->
       netns switch -> (*) -> netns switch ->
         [tc ingress (veth,hostns)] -> [nf ingress (veth,hostns)] ->
           [nf egress (veth,hostns)] -> [tc egress (veth,hostns)] -> (+)

Thanks,
Daniel



[Index of Archives]     [Netfitler Users]     [Berkeley Packet Filter]     [LARTC]     [Bugtraq]     [Yosemite Forum]

  Powered by Linux