Subject prefix should be "[PATCH net-next v6]" (v7 for next version, obviously) On Sat, Nov 02, 2024 at 04:29:35PM +0000, Vladimir Vdovin wrote: > Check number of paths by fib_info_num_path(), > and update_or_create_fnhe() for every path. > Problem is that pmtu is cached only for the oif > that has received icmp message "need to frag", > other oifs will still try to use "default" iface mtu. > > An example topology showing the problem: > > | host1 > +---------+ > | dummy0 | 10.179.20.18/32 mtu9000 > +---------+ > +-----------+----------------+ > +---------+ +---------+ > | ens17f0 | 10.179.2.141/31 | ens17f1 | 10.179.2.13/31 > +---------+ +---------+ > | (all here have mtu 9000) | > +------+ +------+ > | ro1 | 10.179.2.140/31 | ro2 | 10.179.2.12/31 > +------+ +------+ > | | > ---------+------------+-------------------+------ > | > +-----+ > | ro3 | 10.10.10.10 mtu1500 > +-----+ > | > ======================================== > some networks > ======================================== > | > +-----+ > | eth0| 10.10.30.30 mtu9000 > +-----+ > | host2 > > host1 have enabled multipath and > sysctl net.ipv4.fib_multipath_hash_policy = 1: > > default proto static src 10.179.20.18 > nexthop via 10.179.2.12 dev ens17f1 weight 1 > nexthop via 10.179.2.140 dev ens17f0 weight 1 > > When host1 tries to do pmtud from 10.179.20.18/32 to host2, > host1 receives at ens17f1 iface an icmp packet from ro3 that ro3 mtu=1500. > And host1 caches it in nexthop exceptions cache. > > Problem is that it is cached only for the iface that has received icmp, > and there is no way that ro3 will send icmp msg to host1 via another path. > > Host1 now have this routes to host2: > > ip r g 10.10.30.30 sport 30000 dport 443 > 10.10.30.30 via 10.179.2.12 dev ens17f1 src 10.179.20.18 uid 0 > cache expires 521sec mtu 1500 > > ip r g 10.10.30.30 sport 30033 dport 443 > 10.10.30.30 via 10.179.2.140 dev ens17f0 src 10.179.20.18 uid 0 > cache > > So when host1 tries again to reach host2 with mtu>1500, > if packet flow is lucky enough to be hashed with oif=ens17f1 its ok, > if oif=ens17f0 it blackholes and still gets icmp msgs from ro3 to ens17f1, > until lucky day when ro3 will send it through another flow to ens17f0. > > Signed-off-by: Vladimir Vdovin <deliran@xxxxxxxxxx> > --- > > V6: > - make commit message cleaner > > V5: > - make self test cleaner > > V4: > - fix selftest, do route lookup before checking cached exceptions > > V3: > - added selftest > - fixed compile error > > V2: > - fix fib_info_num_path parameter pass > --- > net/ipv4/route.c | 13 +++++ > tools/testing/selftests/net/pmtu.sh | 78 ++++++++++++++++++++++++++++- > 2 files changed, 90 insertions(+), 1 deletion(-) > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c > index 723ac9181558..41162b5cc4cb 100644 > --- a/net/ipv4/route.c > +++ b/net/ipv4/route.c > @@ -1027,6 +1027,19 @@ static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu) > struct fib_nh_common *nhc; > > fib_select_path(net, &res, fl4, NULL); > +#ifdef CONFIG_IP_ROUTE_MULTIPATH > + if (fib_info_num_path(res.fi) > 1) { > + int nhsel; > + > + for (nhsel = 0; nhsel < fib_info_num_path(res.fi); nhsel++) { > + nhc = fib_info_nhc(res.fi, nhsel); > + update_or_create_fnhe(nhc, fl4->daddr, 0, mtu, lock, > + jiffies + net->ipv4.ip_rt_mtu_expires); > + } > + rcu_read_unlock(); > + return; > + } > +#endif /* CONFIG_IP_ROUTE_MULTIPATH */ > nhc = FIB_RES_NHC(res); > update_or_create_fnhe(nhc, fl4->daddr, 0, mtu, lock, > jiffies + net->ipv4.ip_rt_mtu_expires); > diff --git a/tools/testing/selftests/net/pmtu.sh b/tools/testing/selftests/net/pmtu.sh > index 569bce8b6383..a0159340fe84 100755 > --- a/tools/testing/selftests/net/pmtu.sh > +++ b/tools/testing/selftests/net/pmtu.sh > @@ -266,7 +266,8 @@ tests=" > list_flush_ipv4_exception ipv4: list and flush cached exceptions 1 > list_flush_ipv6_exception ipv6: list and flush cached exceptions 1 > pmtu_ipv4_route_change ipv4: PMTU exception w/route replace 1 > - pmtu_ipv6_route_change ipv6: PMTU exception w/route replace 1" > + pmtu_ipv6_route_change ipv6: PMTU exception w/route replace 1 > + pmtu_ipv4_mp_exceptions ipv4: PMTU multipath nh exceptions 0" There is misalignment in the last field ("re-run with nh") and I think it should be "1" in order to exercise this path with both legacy nexthops and nexthop objects. See more below. In addition, this test case is missing documentation at the beginning of the file. I also believe we have a problem with IPv6, but I will look into it when I have time and add a test case. > > # Addressing and routing for tests with routers: four network segments, with > # index SEGMENT between 1 and 4, a common prefix (PREFIX4 or PREFIX6) and an > @@ -2329,6 +2330,81 @@ test_pmtu_ipv6_route_change() { > test_pmtu_ipvX_route_change 6 > } > > +test_pmtu_ipv4_mp_exceptions() { > + setup namespaces routing || return $ksft_skip > + > + ip nexthop ls >/dev/null 2>&1 > + if [ $? -ne 0 ]; then > + echo "Nexthop objects not supported; skipping tests" > + exit $ksft_skip > + fi Whether nexthop objects are supported or not is stored in "HAVE_NH" and there is no need for this check here if you add setup_mp() like I suggest below. > + > + trace "${ns_a}" veth_A-R1 "${ns_r1}" veth_R1-A \ > + "${ns_r1}" veth_R1-B "${ns_b}" veth_B-R1 \ > + "${ns_a}" veth_A-R2 "${ns_r2}" veth_R2-A \ > + "${ns_r2}" veth_R2-B "${ns_b}" veth_B-R2 > + > + dummy0_a="192.168.99.99" > + dummy0_b="192.168.88.88" > + > + # Set up initial MTU values > + mtu "${ns_a}" veth_A-R1 2000 > + mtu "${ns_r1}" veth_R1-A 2000 > + mtu "${ns_r1}" veth_R1-B 1500 > + mtu "${ns_b}" veth_B-R1 1500 According to the documentation at the beginning of the file, the MTU between B and R1 is 1400. See other test cases. > + > + mtu "${ns_a}" veth_A-R2 2000 > + mtu "${ns_r2}" veth_R2-A 2000 > + mtu "${ns_r2}" veth_R2-B 1500 > + mtu "${ns_b}" veth_B-R2 1500 > + > + fail=0 > + > + #Set up host A with multipath routes to host B dummy0_b ^ Missing space after '#'. Same in other places. > + run_cmd ${ns_a} sysctl -q net.ipv4.fib_multipath_hash_policy=1 The test only uses ICMP, so hash policy can be L3 (0). Same for B. > + run_cmd ${ns_a} sysctl -q net.ipv4.ip_forward=1 It's a host, so no need to enable forwarding. > + run_cmd ${ns_a} ip link add dummy0 mtu 2000 type dummy > + run_cmd ${ns_a} ip link set dummy0 up Can't you simply add the address on the loopback device instead of creating a dummy netdev? > + run_cmd ${ns_a} ip addr add ${dummy0_a} dev dummy0 > + run_cmd ${ns_a} ip nexthop add id 201 via ${prefix4}.${a_r1}.2 dev veth_A-R1 > + run_cmd ${ns_a} ip nexthop add id 202 via ${prefix4}.${a_r2}.2 dev veth_A-R2 > + run_cmd ${ns_a} ip nexthop add id 203 group 201/202 > + run_cmd ${ns_a} ip route add ${dummy0_b} nhid 203 It would be good to test this with legacy nexthops as well. I'm not that familiar with pmtu.sh, but I think you can add a "setup_mp" function and invoke it via "setup namespaces routing mp [...]" at the beginning of the function. Internally, setup_mp() will consult "USE_NH" and invoke setup_mp_old() or setup_mp_new() accordingly, like setup_routing() does. > + > + #Set up host B with multipath routes to host A dummy0_a > + run_cmd ${ns_b} sysctl -q net.ipv4.fib_multipath_hash_policy=1 > + run_cmd ${ns_b} sysctl -q net.ipv4.ip_forward=1 > + run_cmd ${ns_b} ip link add dummy0 mtu 2000 type dummy > + run_cmd ${ns_b} ip link set dummy0 up > + run_cmd ${ns_b} ip addr add ${dummy0_b} dev dummy0 > + run_cmd ${ns_b} ip nexthop add id 201 via ${prefix4}.${b_r1}.2 dev veth_A-R1 > + run_cmd ${ns_b} ip nexthop add id 202 via ${prefix4}.${b_r2}.2 dev veth_A-R2 > + run_cmd ${ns_b} ip nexthop add id 203 group 201/202 > + run_cmd ${ns_b} ip route add ${dummy0_a} nhid 203 > + > + #Set up routers with routes to dummies > + run_cmd ${ns_r1} ip route add ${dummy0_a} via ${prefix4}.${a_r1}.1 > + run_cmd ${ns_r2} ip route add ${dummy0_a} via ${prefix4}.${a_r2}.1 > + run_cmd ${ns_r1} ip route add ${dummy0_b} via ${prefix4}.${b_r1}.1 > + run_cmd ${ns_r2} ip route add ${dummy0_b} via ${prefix4}.${b_r2}.1 > + > + > + #Ping and expect two nexthop exceptions for two routes in nh group > + run_cmd ${ns_a} ping -q -M want -i 0.1 -c 2 -s 1800 "${dummy0_b}" '-c 1' should be enough > + > + #Do route lookup before checking cached exceptions IIUC, previous command created an exception in both nexthops. The following commands are needed so that a dst entry will be cached in both exceptions and therefore be dumped below to user space. If so, please adjust the comment to reflect that. > + run_cmd ${ns_a} ip route get ${dummy0_b} oif veth_A-R1 > + run_cmd ${ns_a} ip route get ${dummy0_b} oif veth_A-R2 > + > + #Check cached exceptions > + if [ "$(${ns_a} ip -oneline route list cache| grep mtu | wc -l)" -ne 2 ]; then ^ Missing space before the pipe > + err " there are not enough cached exceptions" > + fail=1 > + fi > + > + return ${fail} > +} > + > usage() { > echo > echo "$0 [OPTIONS] [TEST]..." > > base-commit: 66600fac7a984dea4ae095411f644770b2561ede > -- > 2.43.5 >