On 19/10/2021 13:49, Florian Westphal wrote: > David Ahern <dsahern@xxxxxxxxx> wrote: >> Thanks for the detailed summary and possible solutions. >> >> NAT/MASQ rules with VRF were not really thought about during >> development; it was not a use case (or use cases) Cumulus or other NOS >> vendors cared about. Community users were popping up fairly early and >> patches would get sent, but no real thought about how to handle both >> sets of rules - VRF device and port devices. >> >> What about adding an attribute on the VRF device to declare which side >> to take -- rules against the port device or rules against the VRF device >> and control the nf resets based on it? > > This would need a way to suppress the NF_HOOK invocation from the > normal IP path. Any idea on how to do that? AFAICS there is no way to > get to the vrf device at that point, so no way to detect the toggle. > > Or did you mean to only suppress the 2nd conntrack round? > > For packets that get forwarded we'd always need to run those in the vrf > context, afaics, because doing an nf_reset() may create a new conntrack > entry (if flow has DNAT, then incoming address has been reversed > already, so it won't match existing REPLY entry in the conntrack table anymore). > > For locally generated packets, we could skip conntrack for VRF context > via 'skb->_nfct = UNTRACKED' + nf_reset_ct before xmit to lower device, > and for lower device by eliding the reset entirely. I think that I have SNAT (at least) working fine with VRFs, without the commit. What I do is I set notrack at vrf prerouting callback. Could it be the "proper" way to go? I don't know if I am breaking anything else though. Here is my reproducer script. SNAT works on kernels without the "reset conntrack" commit. (Sorry my Thunderbird inserts extra newlines :( ) ======== #!/bin/sudo /bin/bash for i in 1 2; do ip li sh src$i >/dev/null 2>&1 && ip li set src$i nomaster \ && ip li del src$i ip li sh sink$i >/dev/null 2>&1 && ip li del sink$i ip li sh vrf$i >/dev/null 2>&1 && ip li del vrf$i ip r flush table 10$i ip netns list | grep -q ns$i && ip netns del ns$i done nft list table testnat >/dev/null 2>&1 && nft delete table testnat case $1 in clean) echo "cleaned up"; exit 0;; esac sysctl -w net.ipv4.ip_forward=1 for i in 1 2; do ip netns add ns$i ip netns exec ns$i ip li set lo up ip li add vrf$i type vrf table 10$i ip r add vrf vrf$i unreachable default metric 4278198272 ip li add src$i type veth peer wayout netns ns$i ip li set src$i master vrf$i ip a add 172.31.$i.1/32 dev src$i ip li set src$i up ip li set vrf$i up #/sbin/sysctl -w net.ipv4.conf.src$i.accept_local=1 ip netns exec ns$i ip a add 172.31.$i.2/24 dev wayout ip netns exec ns$i ip li set wayout up ip netns exec ns$i ip r add default via 172.31.$i.1 ip li add sink$i type veth peer wayin netns ns$i ip netns exec ns$i ip li set wayin up ip netns exec ns$i /sbin/sysctl -w net.ipv4.conf.wayin.rp_filter=0 ip li set sink$i up done ip r add 172.31.1.0/24 dev sink1 table 102 ip r add 172.31.2.0/24 dev sink2 table 101 ip r add 100.64.0.0/24 dev sink1 table 102 nft -f - <<__END__ table testnat { chain rawpre { type filter hook prerouting priority raw; #iif { src1, src2 } meta nftrace set 1 iif { src1, src2 } ct zone set 1 return notrack } chain rawout { type filter hook output priority raw; notrack } chain natpost { type nat hook postrouting priority srcnat; oif sink2 snat ip to 100.64.0.2 } } __END__ conntrack -F ip netns exec ns2 tcpdump -lni wayin arp or icmp & tdpid=$! sleep 1 ip netns exec ns1 ping -W 1 -c 1 172.31.2.2 conntrack -L sleep 1 kill $tdpid wait
Attachment:
OpenPGP_signature
Description: OpenPGP digital signature