On Tue, Oct 12, 2021 at 03:28:39PM +0200, Eugene Crosser wrote: > CAUTION: External E-Mail - Use caution with links and attachments > > Date: Tue, 12 Oct 2021 15:28:39 +0200 > From: Eugene Crosser <crosser@xxxxxxxxxxx> > To: netdev@xxxxxxxxxxxxxxx > Cc: netfilter-devel@xxxxxxxxxxxxxxx, Lahav Schlesinger > <lschlesinger@xxxxxxxxxxxxx>, David Ahern <dsahern@xxxxxxxxxx> > Subject: Commit 09e856d54bda5f288ef8437a90ab2b9b3eab83d1r "vrf: Reset skb > conntrack connection on VRF rcv" breaks expected netfilter behaviour > User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 > Thunderbird/78.13.0 > > Hello all, > > > Commit mentioned in the subject was intended to avoid creation of stray > conntrack entries when input interface is enslaved in a VRF, and thus > prerouting conntrack hook is called twice: once in the context of the > original input interface, and once in the context of the VRF interface. > Solution was to nuke conntrack related data associated with the skb when > it enters VRF context. > > > > However this breaks netfilter operation. Imagine a use case when > conntrack zone must be assigned based on the (original, "real") input > interface, rather than VRF interface (that can enslave multiple "real" > interfaces, that would become indistinguishable). One could create > netfilter rules similar to these: > > > > chain rawprerouting { > > type filter hook prerouting priority raw; > > iif realiface1 ct zone set 1 return > > iif realiface2 ct zone set 2 return > > } > > > > This works before the mentioned commit, but not after: zone assignment > is "forgotten", and any subsequent NAT or filtering that is dependent on > the conntrack zone does not work. > > > > There is a reproducer script at the bottom of this message that > demonstrates the difference in behaviour. > > > > Maybe a better solution for stray conntrack entries would be to > introduce finer control in netfilter? One possible idea would be to > implement both "track" and "notrack" targets; then a working > configuration would look like this: > > > > chain rawprerouting { > > type filter hook prerouting priority raw; > > iif realiface1 ct zone set 1 notrack > > iif realiface2 ct zone set 2 notrack > > iif vrfmaster track > > } > > > > so in the original input interface context, zone is assigned, but > conntrack processing itself is shortcircuited. When the packet enters > VRF context, conntracking is reenabled, so one entry is created, in the > zone assigned at an earlier stage. > > > > This is just an idea, I don't have enough knowledge to judge how > workable is it. > > > > For reference, this is a thread about the issue in netfilter-devel: > https://marc.info/?t=163310182600001&r=1&w=2 > > > > Thank you, > > > > Eugene > > > > ========== > > #!/bin/sh > > > > # This script demonstrates unexpected change of nftables behaviour > > # caused by commit 09e856d54bda5f28 ""vrf: Reset skb conntrack > > # connection on VRF rcv" > > # > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=09e856d54bda5f288ef8437a90ab2b9b3eab83d1 > > # > > # Before the commit, it was possible to assign conntrack zone to a > > # packet (or mark it for `notracking`) in the prerouting chanin, raw > > # priority, based on the `iif` (interface from which the packet > > # arrived). > > # After the change, # if the interface is enslaved in a VRF, such > > # assignment is lost. Instead, assignment based on the `iif` matching > > # the VRF master interface is honored. Thus it is impossible to > > # distinguish packets based on the original interface. > > # > > # This script demonstrates this change of behaviour: conntrack zone 1 > > # or 2 is assigned depending on the match with the original interface > > # or the vrf master interface. It can be observed that conntrack entry > > # appears in different zone in the kernel versions before and after > > # the commit. Additionaly, the script produces netfilter trace files > > # that can be used for debugging the issue. > > > > IPIN=172.30.30.1 > > IPOUT=172.30.30.2 > > PFXL=30 > > > > ip li sh vein >/dev/null 2>&1 && ip li del vein > > ip li sh tvrf >/dev/null 2>&1 && ip li del tvrf > > nft list table testct >/dev/null 2>&1 && nft delete table testct > > > > ip li add vein type veth peer veout > > ip li add tvrf type vrf table 9876 > > ip li set veout master tvrf > > ip li set vein up > > ip li set veout up > > ip li set tvrf up > > /sbin/sysctl -w net.ipv4.conf.veout.accept_local=1 > > /sbin/sysctl -w net.ipv4.conf.veout.rp_filter=0 > > ip addr add $IPIN/$PFXL dev vein > > ip addr add $IPOUT/$PFXL dev veout > > > > nft -f - <<__END__ > > table testct { > > chain rawpre { > > type filter hook prerouting priority raw; > > iif { veout, tvrf } meta nftrace set 1 > > iif veout ct zone set 1 return > > iif tvrf ct zone set 2 return > > notrack > > } > > chain rawout { > > type filter hook output priority raw; > > notrack > > } > > } > > __END__ > > > > uname -rv > > conntrack -F > > stdbuf -o0 nft monitor trace >nftrace.`uname -r`.txt & > > monpid=$! > > ping -W 1 -c 1 -I vein $IPOUT > > conntrack -L > > sleep 1 > > kill -15 $monpid > > wait > Hi Eugene, I apologie for the late response (was on vacation). The call to nf_reset_ct() I added was to match the existing call in the egress flow, which I didn't want to change in order to not break existing behaviour (which I unintentionally still did :-)). Seems like any combination of calling nf_reset_ct() will lead to something breaking. So continuing on what Florian suggested, another possibility is to make the calls to nf_reset_ct() in both ingress and egress flow configurable (procfs or new flags to RTM_NEWLINK). One benefit of this is that disabling nf_reset_ct() on the egress flow will mean no port SNAT will take place when SNAT rule is installed on a VRF (as I described in my original commit), which can break applications that depend on using a specific source port. I'll need to think about it some more though, I don't have any better solution right now.