Re: Commit 09e856d54bda5f288ef8437a90ab2b9b3eab83d1r "vrf: Reset skb conntrack connection on VRF rcv" breaks expected netfilter behaviour

Florian Westphal <fw@xxxxxxxxx> · Mon, 18 Oct 2021 16:34:30 +0200

David Ahern <dsahern@xxxxxxxxx> wrote:

Hi David

TL;DR:
What is your take on the correct/expected behaviour with vrf+conntrack+nat?

> Thanks for jumping in, Florian.

NP.

Sorry for the wall of text below.

You can fast-forward to 'Possible solutions' if you want, but they key
question for me at the moment is the one above.

I've just submitted a selftest patch to nf.git that adds test
cases for the problem reported by Eugene + two masquerade/snat test
cases.

With net/net-next, first test fails and the other two work, after
revert its vice versa.

To get all three working there are a couple of possible solutions to
this but so far I don't have anything that is void of side effects.

It assumes revert of the problematic commit, i.e. no nf_reset_ct in
ingress path from VRF driver.

First, a summary of VRF+nf+conntrack interaction and where problems are.

The VRF driver invokes netfilter for output+postrouting hooks,
with outdev set to the vrf device. Afterwards, ip stack calls those
hooks again, with outdev set to lower device.

This is a problem when conntrack is used with IP masquerading.
With all nf_reset_ct() in vrf driver removed following will happen:

1. Conntrack only, no nat, locally generated traffic.

vrf driver calls output/postrouting hooks.
output creates new conntrack object and attaches it to skb.
postrouting confirms entry and places it into state table.

When hooks are called a second time by IP stack, no new conntrack lookup is
done because skb already has one attached to it.

2. When SNAT is used, this works as well, second iteration doesn't
   do connection tracking and re-uses nat settings done in iteration 1.

However there are caveats:
a) NAT rules that use something like '-o eth0' won't have any effect.
b) IP address matching in round 2 is 'broken', as the second round deals
with a rewritten skb (iph saddr already updated).

This is because when the hooks are invoked a second time, there already
is a NAT binding attached to the entry. This happens regardless of a
matching '-o vrfdevice' nat rule or not; when first round did not match
a nat rule, nat engine attaches a 'nat null binding'.

3) When Masquerade is used, things don't work at all.

This is because of nf_nat_oif_changed() triggering errnously in the nat
engine.  When masquerade is hit, the output interface index gets stored
in the conntrack entry.  When the interface index changes, its assumed
that a routing change took place and the connection has been broken
(masquerade picks saddr based on the output interface).

Therefore, NF_DROP gets returned.

In VRF case, this triggers because we see the skb twice, once with
ifindex == vrf and once with lower/physdev.

I suspect that this is what lead eb63ecc1706b3e094d0f57438b6c2067cfc299f2
(net: vrf: Drop conntrack data after pass through VRF device on Tx),
this added nf_reset() calls to the tx path.

This changes things as follows:

1. Conntrack only, no nat:
same as before, expect that the second round does a new conntrack lookup.
It will find the entry created by first iteration and use that.

2. With nat:
If first round has no matching nat rule, things work:
Second round will find existing entry and use it.
NAT rules on second iteration have no effect, just like before.

If first round had a matching nat rule, then the packet gets rewritten.
This means that the second round will NOT find an existing entry, and
conntrack tracks the flow a second time, this time with the post-nat
source address.

Because of this, conntrack will also detect a tuple collision when it
tries to insert the 'new flow incarnation', because the reply direction
of the 'first round flow' collides with the original direction of the
second iteration. This forces allocation of a new source port, so source
port translation is done.

This in turn breaks in reverse direction, because incoming/reply packet
only hits the connection tracking engine once, i.e. only the source
port translation is reversed.

To fix this, Lahav also added nf_reset_ct() to ingress processing to
undo the first round nat transformation as well.

... and that in turn breaks 'notrack', 'mark' or 'ct zone' assignments
done based on the incoming/lower device -- the nf_reset_ct zaps those
settings before round 2 so they have no effect anymore.

Possible solutions
==================

Taking a few steps back and ignoring 'backwards compat' for now, I think
that conntrack should process the flow only once.  VRF doesn't transform the
packets in any way and the only reason for the extra NF_HOOK() calls appear to
be so that users can match on the incoming/outgoing vrf interface.

a)
For locally generated packets, the most simple fix would be to mark
skb->nfct as 'untracked', and clear it again instead of nf_reset_ct().

This avoids the need to change anyting on conntrack/nat side.
The downside is that it becomes impossible to add nat mappings based
on '-o vrf', because conntrack gets bypassed in round 1.

For forwarding case (where OUTPUT hooks are not hit and
ingress path has attached skb->nfct), we would need to find a different
way to bypass conntrack.  One solution (least-LOC) is to nf_reset() and
then mark skb as untracked.  IP(6)CB should have FORWARD flag set that
can be used to detect this condition.

b)
make the nf_ct_reset calls in vrf tx path conditional.
Its possible to detect when a NAT rule was hit via ct->status bits.

When the NF_HOOK() calls invoked via VRF found a SNAT/MASQ rule,
don't reset the conntrack entry.

Downside 1: the second invocation is done with 'incorrect' ip source
address, OTOH that already happens at this time.

Downside 2: We need to alter conntrack/nat to avoid the 'oif has
changed' logic from kicking in.

I don't see other solutions at the moment.

For INPUT, users can also match the lower device via inet_sdif()
(original ifindex stored in IP(6)CB), but we don't have that
for output, and its not easy to add something because IPCB isn't
preserved between rounds 1 & 2.