Re: [PATCH v6] net: ipv4: Cache pmtu for all packet paths if multipath enabled

"Vladimir Vdovin" <deliran@xxxxxxxxxx> · Mon, 04 Nov 2024 10:30:41 +0300

Many thanks for your comments! Sent new patch with proper fixes

On Sun Nov 3, 2024 at 4:05 PM MSK, Ido Schimmel wrote:
> Subject prefix should be "[PATCH net-next v6]" (v7 for next version,
> obviously)
>
> On Sat, Nov 02, 2024 at 04:29:35PM +0000, Vladimir Vdovin wrote:
> > Check number of paths by fib_info_num_path(),
> > and update_or_create_fnhe() for every path.
> > Problem is that pmtu is cached only for the oif
> > that has received icmp message "need to frag",
> > other oifs will still try to use "default" iface mtu.
> > 
> > An example topology showing the problem:
> > 
> >                     |  host1
> >                 +---------+
> >                 |  dummy0 | 10.179.20.18/32  mtu9000
> >                 +---------+
> >         +-----------+----------------+
> >     +---------+                     +---------+
> >     | ens17f0 |  10.179.2.141/31    | ens17f1 |  10.179.2.13/31
> >     +---------+                     +---------+
> >         |    (all here have mtu 9000)    |
> >     +------+                         +------+
> >     | ro1  |  10.179.2.140/31        | ro2  |  10.179.2.12/31
> >     +------+                         +------+
> >         |                                |
> > ---------+------------+-------------------+------
> >                         |
> >                     +-----+
> >                     | ro3 | 10.10.10.10  mtu1500
> >                     +-----+
> >                         |
> >     ========================================
> >                 some networks
> >     ========================================
> >                         |
> >                     +-----+
> >                     | eth0| 10.10.30.30  mtu9000
> >                     +-----+
> >                         |  host2
> > 
> > host1 have enabled multipath and
> > sysctl net.ipv4.fib_multipath_hash_policy = 1:
> > 
> > default proto static src 10.179.20.18
> >         nexthop via 10.179.2.12 dev ens17f1 weight 1
> >         nexthop via 10.179.2.140 dev ens17f0 weight 1
> > 
> > When host1 tries to do pmtud from 10.179.20.18/32 to host2,
> > host1 receives at ens17f1 iface an icmp packet from ro3 that ro3 mtu=1500.
> > And host1 caches it in nexthop exceptions cache.
> > 
> > Problem is that it is cached only for the iface that has received icmp,
> > and there is no way that ro3 will send icmp msg to host1 via another path.
> > 
> > Host1 now have this routes to host2:
> > 
> > ip r g 10.10.30.30 sport 30000 dport 443
> > 10.10.30.30 via 10.179.2.12 dev ens17f1 src 10.179.20.18 uid 0
> >     cache expires 521sec mtu 1500
> > 
> > ip r g 10.10.30.30 sport 30033 dport 443
> > 10.10.30.30 via 10.179.2.140 dev ens17f0 src 10.179.20.18 uid 0
> >     cache
> > 
> > So when host1 tries again to reach host2 with mtu>1500,
> > if packet flow is lucky enough to be hashed with oif=ens17f1 its ok,
> > if oif=ens17f0 it blackholes and still gets icmp msgs from ro3 to ens17f1,
> > until lucky day when ro3 will send it through another flow to ens17f0.
> > 
> > Signed-off-by: Vladimir Vdovin <deliran@xxxxxxxxxx>
> > ---
> > 
> > V6:
> >   - make commit message cleaner
> > 
> > V5:
> >   - make self test cleaner
> > 
> > V4:
> >   - fix selftest, do route lookup before checking cached exceptions
> > 
> > V3:
> >   - added selftest
> >   - fixed compile error
> > 
> > V2:
> >   - fix fib_info_num_path parameter pass
> > ---
> >  net/ipv4/route.c                    | 13 +++++
> >  tools/testing/selftests/net/pmtu.sh | 78 ++++++++++++++++++++++++++++-
> >  2 files changed, 90 insertions(+), 1 deletion(-)
> > 
> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > index 723ac9181558..41162b5cc4cb 100644
> > --- a/net/ipv4/route.c
> > +++ b/net/ipv4/route.c
> > @@ -1027,6 +1027,19 @@ static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu)
> >  		struct fib_nh_common *nhc;
> >  
> >  		fib_select_path(net, &res, fl4, NULL);
> > +#ifdef CONFIG_IP_ROUTE_MULTIPATH
> > +		if (fib_info_num_path(res.fi) > 1) {
> > +			int nhsel;
> > +
> > +			for (nhsel = 0; nhsel < fib_info_num_path(res.fi); nhsel++) {
> > +				nhc = fib_info_nhc(res.fi, nhsel);
> > +				update_or_create_fnhe(nhc, fl4->daddr, 0, mtu, lock,
> > +							  jiffies + net->ipv4.ip_rt_mtu_expires);
> > +			}
> > +			rcu_read_unlock();
> > +			return;
> > +		}
> > +#endif /* CONFIG_IP_ROUTE_MULTIPATH */
> >  		nhc = FIB_RES_NHC(res);
> >  		update_or_create_fnhe(nhc, fl4->daddr, 0, mtu, lock,
> >  				      jiffies + net->ipv4.ip_rt_mtu_expires);
> > diff --git a/tools/testing/selftests/net/pmtu.sh b/tools/testing/selftests/net/pmtu.sh
> > index 569bce8b6383..a0159340fe84 100755
> > --- a/tools/testing/selftests/net/pmtu.sh
> > +++ b/tools/testing/selftests/net/pmtu.sh
> > @@ -266,7 +266,8 @@ tests="
> >  	list_flush_ipv4_exception	ipv4: list and flush cached exceptions	1
> >  	list_flush_ipv6_exception	ipv6: list and flush cached exceptions	1
> >  	pmtu_ipv4_route_change		ipv4: PMTU exception w/route replace	1
> > -	pmtu_ipv6_route_change		ipv6: PMTU exception w/route replace	1"
> > +	pmtu_ipv6_route_change		ipv6: PMTU exception w/route replace	1
> > +	pmtu_ipv4_mp_exceptions		ipv4: PMTU multipath nh exceptions		0"
>
> There is misalignment in the last field ("re-run with nh") and I think
> it should be "1" in order to exercise this path with both legacy
> nexthops and nexthop objects. See more below.
>
> In addition, this test case is missing documentation at the beginning of
> the file.
>
> I also believe we have a problem with IPv6, but I will look into it when
> I have time and add a test case.
>
> >  
> >  # Addressing and routing for tests with routers: four network segments, with
> >  # index SEGMENT between 1 and 4, a common prefix (PREFIX4 or PREFIX6) and an
> > @@ -2329,6 +2330,81 @@ test_pmtu_ipv6_route_change() {
> >  	test_pmtu_ipvX_route_change 6
> >  }
> >  
> > +test_pmtu_ipv4_mp_exceptions() {
> > +	setup namespaces routing || return $ksft_skip
> > +
> > +	ip nexthop ls >/dev/null 2>&1
> > +	if [ $? -ne 0 ]; then
> > +		echo "Nexthop objects not supported; skipping tests"
> > +		exit $ksft_skip
> > +	fi
>
> Whether nexthop objects are supported or not is stored in "HAVE_NH" and
> there is no need for this check here if you add setup_mp() like I
> suggest below.
>
> > +
> > +	trace "${ns_a}"  veth_A-R1    "${ns_r1}" veth_R1-A \
> > +	      "${ns_r1}" veth_R1-B    "${ns_b}"  veth_B-R1 \
> > +	      "${ns_a}"  veth_A-R2    "${ns_r2}" veth_R2-A \
> > +	      "${ns_r2}" veth_R2-B    "${ns_b}"  veth_B-R2
> > +
> > +	dummy0_a="192.168.99.99"
> > +	dummy0_b="192.168.88.88"
> > +
> > +	# Set up initial MTU values
> > +	mtu "${ns_a}"  veth_A-R1 2000
> > +	mtu "${ns_r1}" veth_R1-A 2000
> > +	mtu "${ns_r1}" veth_R1-B 1500
> > +	mtu "${ns_b}"  veth_B-R1 1500
>
> According to the documentation at the beginning of the file, the MTU
> between B and R1 is 1400. See other test cases.
>
> > +
> > +	mtu "${ns_a}"  veth_A-R2 2000
> > +	mtu "${ns_r2}" veth_R2-A 2000
> > +	mtu "${ns_r2}" veth_R2-B 1500
> > +	mtu "${ns_b}"  veth_B-R2 1500
> > +
> > +	fail=0
> > +
> > +	#Set up host A with multipath routes to host B dummy0_b
>          ^ Missing space after '#'. Same in other places.
>
> > +	run_cmd ${ns_a} sysctl -q net.ipv4.fib_multipath_hash_policy=1
>
> The test only uses ICMP, so hash policy can be L3 (0). Same for B.
>
> > +	run_cmd ${ns_a} sysctl -q net.ipv4.ip_forward=1
>
> It's a host, so no need to enable forwarding.
>
> > +	run_cmd ${ns_a} ip link add dummy0 mtu 2000 type dummy
> > +	run_cmd ${ns_a} ip link set dummy0 up
>
> Can't you simply add the address on the loopback device instead of
> creating a dummy netdev?
>
> > +	run_cmd ${ns_a} ip addr add ${dummy0_a} dev dummy0
> > +	run_cmd ${ns_a} ip nexthop add id 201 via ${prefix4}.${a_r1}.2 dev veth_A-R1
> > +	run_cmd ${ns_a} ip nexthop add id 202 via ${prefix4}.${a_r2}.2 dev veth_A-R2
> > +	run_cmd ${ns_a} ip nexthop add id 203 group 201/202
> > +	run_cmd ${ns_a} ip route add ${dummy0_b} nhid 203
>
> It would be good to test this with legacy nexthops as well. I'm not that
> familiar with pmtu.sh, but I think you can add a "setup_mp" function and
> invoke it via "setup namespaces routing mp [...]" at the beginning of
> the function.
>
> Internally, setup_mp() will consult "USE_NH" and invoke setup_mp_old()
> or setup_mp_new() accordingly, like setup_routing() does.
>
> > +
> > +	#Set up host B with multipath routes to host A dummy0_a
> > +	run_cmd ${ns_b} sysctl -q net.ipv4.fib_multipath_hash_policy=1
> > +	run_cmd ${ns_b} sysctl -q net.ipv4.ip_forward=1
> > +	run_cmd ${ns_b} ip link add dummy0 mtu 2000 type dummy
> > +	run_cmd ${ns_b} ip link set dummy0 up
> > +	run_cmd ${ns_b} ip addr add ${dummy0_b} dev dummy0
> > +	run_cmd ${ns_b} ip nexthop add id 201 via ${prefix4}.${b_r1}.2 dev veth_A-R1
> > +	run_cmd ${ns_b} ip nexthop add id 202 via ${prefix4}.${b_r2}.2 dev veth_A-R2
> > +	run_cmd ${ns_b} ip nexthop add id 203 group 201/202
> > +	run_cmd ${ns_b} ip route add ${dummy0_a} nhid 203
> > +
> > +	#Set up routers with routes to dummies
> > +	run_cmd ${ns_r1} ip route add ${dummy0_a} via ${prefix4}.${a_r1}.1
> > +	run_cmd ${ns_r2} ip route add ${dummy0_a} via ${prefix4}.${a_r2}.1
> > +	run_cmd ${ns_r1} ip route add ${dummy0_b} via ${prefix4}.${b_r1}.1
> > +	run_cmd ${ns_r2} ip route add ${dummy0_b} via ${prefix4}.${b_r2}.1
> > +
> > +
> > +	#Ping and expect two nexthop exceptions for two routes in nh group
> > +	run_cmd ${ns_a} ping -q -M want -i 0.1 -c 2 -s 1800 "${dummy0_b}"
>
> '-c 1' should be enough
>
> > +
> > +	#Do route lookup before checking cached exceptions
>
> IIUC, previous command created an exception in both nexthops. The
> following commands are needed so that a dst entry will be cached in both
> exceptions and therefore be dumped below to user space. If so, please
> adjust the comment to reflect that.
>
> > +	run_cmd ${ns_a} ip route get ${dummy0_b} oif veth_A-R1
> > +	run_cmd ${ns_a} ip route get ${dummy0_b} oif veth_A-R2
> > +
> > +	#Check cached exceptions
> > +	if [ "$(${ns_a} ip -oneline route list cache| grep mtu | wc -l)" -ne 2 ]; then
>                                                    ^ Missing space before the pipe
>
>
> > +		err "  there are not enough cached exceptions"
> > +		fail=1
> > +	fi
> > +
> > +	return ${fail}
> > +}
> > +
> >  usage() {
> >  	echo
> >  	echo "$0 [OPTIONS] [TEST]..."
> > 
> > base-commit: 66600fac7a984dea4ae095411f644770b2561ede
> > -- 
> > 2.43.5
> >