Re: Commit 09e856d54bda5f288ef8437a90ab2b9b3eab83d1r "vrf: Reset skb conntrack connection on VRF rcv" breaks expected netfilter behaviour

David Ahern <dsahern@xxxxxxxxx> · Mon, 18 Oct 2021 12:14:25 -0600

On 10/18/21 8:34 AM, Florian Westphal wrote:
> David Ahern <dsahern@xxxxxxxxx> wrote:
> 
> Hi David
> 
> TL;DR:
> What is your take on the correct/expected behaviour with vrf+conntrack+nat?
> 
>> Thanks for jumping in, Florian.
> 
> NP.
> 
> Sorry for the wall of text below.
> 
> You can fast-forward to 'Possible solutions' if you want, but they key
> question for me at the moment is the one above.
> 
> I've just submitted a selftest patch to nf.git that adds test
> cases for the problem reported by Eugene + two masquerade/snat test
> cases.
> 
> With net/net-next, first test fails and the other two work, after
> revert its vice versa.
> 
> To get all three working there are a couple of possible solutions to
> this but so far I don't have anything that is void of side effects.
> 
> It assumes revert of the problematic commit, i.e. no nf_reset_ct in
> ingress path from VRF driver.
> 
> First, a summary of VRF+nf+conntrack interaction and where problems are.
> 
> The VRF driver invokes netfilter for output+postrouting hooks,
> with outdev set to the vrf device. Afterwards, ip stack calls those
> hooks again, with outdev set to lower device.
> 
> This is a problem when conntrack is used with IP masquerading.
> With all nf_reset_ct() in vrf driver removed following will happen:
> 
> 1. Conntrack only, no nat, locally generated traffic.
> 
> vrf driver calls output/postrouting hooks.
> output creates new conntrack object and attaches it to skb.
> postrouting confirms entry and places it into state table.
> 
> When hooks are called a second time by IP stack, no new conntrack lookup is
> done because skb already has one attached to it.
> 
> 2. When SNAT is used, this works as well, second iteration doesn't
>    do connection tracking and re-uses nat settings done in iteration 1.
> 
> However there are caveats:
> a) NAT rules that use something like '-o eth0' won't have any effect.
> b) IP address matching in round 2 is 'broken', as the second round deals
> with a rewritten skb (iph saddr already updated).
> 
> This is because when the hooks are invoked a second time, there already
> is a NAT binding attached to the entry. This happens regardless of a
> matching '-o vrfdevice' nat rule or not; when first round did not match
> a nat rule, nat engine attaches a 'nat null binding'.
> 
> 3) When Masquerade is used, things don't work at all.
> 
> This is because of nf_nat_oif_changed() triggering errnously in the nat
> engine.  When masquerade is hit, the output interface index gets stored
> in the conntrack entry.  When the interface index changes, its assumed
> that a routing change took place and the connection has been broken
> (masquerade picks saddr based on the output interface).
> 
> Therefore, NF_DROP gets returned.
> 
> In VRF case, this triggers because we see the skb twice, once with
> ifindex == vrf and once with lower/physdev.
> 
> I suspect that this is what lead eb63ecc1706b3e094d0f57438b6c2067cfc299f2
> (net: vrf: Drop conntrack data after pass through VRF device on Tx),
> this added nf_reset() calls to the tx path.
> 
> This changes things as follows:
> 
> 1. Conntrack only, no nat:
> same as before, expect that the second round does a new conntrack lookup.
> It will find the entry created by first iteration and use that.
> 
> 2. With nat:
> If first round has no matching nat rule, things work:
> Second round will find existing entry and use it.
> NAT rules on second iteration have no effect, just like before.
> 
> If first round had a matching nat rule, then the packet gets rewritten.
> This means that the second round will NOT find an existing entry, and
> conntrack tracks the flow a second time, this time with the post-nat
> source address.
> 
> Because of this, conntrack will also detect a tuple collision when it
> tries to insert the 'new flow incarnation', because the reply direction
> of the 'first round flow' collides with the original direction of the
> second iteration. This forces allocation of a new source port, so source
> port translation is done.
> 
> This in turn breaks in reverse direction, because incoming/reply packet
> only hits the connection tracking engine once, i.e. only the source
> port translation is reversed.
> 
> To fix this, Lahav also added nf_reset_ct() to ingress processing to
> undo the first round nat transformation as well.
> 
> ... and that in turn breaks 'notrack', 'mark' or 'ct zone' assignments
> done based on the incoming/lower device -- the nf_reset_ct zaps those
> settings before round 2 so they have no effect anymore.
> 
> Possible solutions
> ==================
> 
> Taking a few steps back and ignoring 'backwards compat' for now, I think
> that conntrack should process the flow only once.  VRF doesn't transform the
> packets in any way and the only reason for the extra NF_HOOK() calls appear to
> be so that users can match on the incoming/outgoing vrf interface.
> 
> a)
> For locally generated packets, the most simple fix would be to mark
> skb->nfct as 'untracked', and clear it again instead of nf_reset_ct().
> 
> This avoids the need to change anyting on conntrack/nat side.
> The downside is that it becomes impossible to add nat mappings based
> on '-o vrf', because conntrack gets bypassed in round 1.
> 
> For forwarding case (where OUTPUT hooks are not hit and
> ingress path has attached skb->nfct), we would need to find a different
> way to bypass conntrack.  One solution (least-LOC) is to nf_reset() and
> then mark skb as untracked.  IP(6)CB should have FORWARD flag set that
> can be used to detect this condition.
> 
> b)
> make the nf_ct_reset calls in vrf tx path conditional.
> Its possible to detect when a NAT rule was hit via ct->status bits.
> 
> When the NF_HOOK() calls invoked via VRF found a SNAT/MASQ rule,
> don't reset the conntrack entry.
> 
> Downside 1: the second invocation is done with 'incorrect' ip source
> address, OTOH that already happens at this time.
> 
> Downside 2: We need to alter conntrack/nat to avoid the 'oif has
> changed' logic from kicking in.
> 
> I don't see other solutions at the moment.
> 
> For INPUT, users can also match the lower device via inet_sdif()
> (original ifindex stored in IP(6)CB), but we don't have that
> for output, and its not easy to add something because IPCB isn't
> preserved between rounds 1 & 2.
> 

Thanks for the detailed summary and possible solutions.

NAT/MASQ rules with VRF were not really thought about during
development; it was not a use case (or use cases) Cumulus or other NOS
vendors cared about. Community users were popping up fairly early and
patches would get sent, but no real thought about how to handle both
sets of rules - VRF device and port devices.

What about adding an attribute on the VRF device to declare which side
to take -- rules against the port device or rules against the VRF device
and control the nf resets based on it?