Many thanks for your comments! Sent new patch with proper fixes On Sun Nov 3, 2024 at 4:05 PM MSK, Ido Schimmel wrote: > Subject prefix should be "[PATCH net-next v6]" (v7 for next version, > obviously) > > On Sat, Nov 02, 2024 at 04:29:35PM +0000, Vladimir Vdovin wrote: > > Check number of paths by fib_info_num_path(), > > and update_or_create_fnhe() for every path. > > Problem is that pmtu is cached only for the oif > > that has received icmp message "need to frag", > > other oifs will still try to use "default" iface mtu. > > > > An example topology showing the problem: > > > > | host1 > > +---------+ > > | dummy0 | 10.179.20.18/32 mtu9000 > > +---------+ > > +-----------+----------------+ > > +---------+ +---------+ > > | ens17f0 | 10.179.2.141/31 | ens17f1 | 10.179.2.13/31 > > +---------+ +---------+ > > | (all here have mtu 9000) | > > +------+ +------+ > > | ro1 | 10.179.2.140/31 | ro2 | 10.179.2.12/31 > > +------+ +------+ > > | | > > ---------+------------+-------------------+------ > > | > > +-----+ > > | ro3 | 10.10.10.10 mtu1500 > > +-----+ > > | > > ======================================== > > some networks > > ======================================== > > | > > +-----+ > > | eth0| 10.10.30.30 mtu9000 > > +-----+ > > | host2 > > > > host1 have enabled multipath and > > sysctl net.ipv4.fib_multipath_hash_policy = 1: > > > > default proto static src 10.179.20.18 > > nexthop via 10.179.2.12 dev ens17f1 weight 1 > > nexthop via 10.179.2.140 dev ens17f0 weight 1 > > > > When host1 tries to do pmtud from 10.179.20.18/32 to host2, > > host1 receives at ens17f1 iface an icmp packet from ro3 that ro3 mtu=1500. > > And host1 caches it in nexthop exceptions cache. > > > > Problem is that it is cached only for the iface that has received icmp, > > and there is no way that ro3 will send icmp msg to host1 via another path. > > > > Host1 now have this routes to host2: > > > > ip r g 10.10.30.30 sport 30000 dport 443 > > 10.10.30.30 via 10.179.2.12 dev ens17f1 src 10.179.20.18 uid 0 > > cache expires 521sec mtu 1500 > > > > ip r g 10.10.30.30 sport 30033 dport 443 > > 10.10.30.30 via 10.179.2.140 dev ens17f0 src 10.179.20.18 uid 0 > > cache > > > > So when host1 tries again to reach host2 with mtu>1500, > > if packet flow is lucky enough to be hashed with oif=ens17f1 its ok, > > if oif=ens17f0 it blackholes and still gets icmp msgs from ro3 to ens17f1, > > until lucky day when ro3 will send it through another flow to ens17f0. > > > > Signed-off-by: Vladimir Vdovin <deliran@xxxxxxxxxx> > > --- > > > > V6: > > - make commit message cleaner > > > > V5: > > - make self test cleaner > > > > V4: > > - fix selftest, do route lookup before checking cached exceptions > > > > V3: > > - added selftest > > - fixed compile error > > > > V2: > > - fix fib_info_num_path parameter pass > > --- > > net/ipv4/route.c | 13 +++++ > > tools/testing/selftests/net/pmtu.sh | 78 ++++++++++++++++++++++++++++- > > 2 files changed, 90 insertions(+), 1 deletion(-) > > > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c > > index 723ac9181558..41162b5cc4cb 100644 > > --- a/net/ipv4/route.c > > +++ b/net/ipv4/route.c > > @@ -1027,6 +1027,19 @@ static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu) > > struct fib_nh_common *nhc; > > > > fib_select_path(net, &res, fl4, NULL); > > +#ifdef CONFIG_IP_ROUTE_MULTIPATH > > + if (fib_info_num_path(res.fi) > 1) { > > + int nhsel; > > + > > + for (nhsel = 0; nhsel < fib_info_num_path(res.fi); nhsel++) { > > + nhc = fib_info_nhc(res.fi, nhsel); > > + update_or_create_fnhe(nhc, fl4->daddr, 0, mtu, lock, > > + jiffies + net->ipv4.ip_rt_mtu_expires); > > + } > > + rcu_read_unlock(); > > + return; > > + } > > +#endif /* CONFIG_IP_ROUTE_MULTIPATH */ > > nhc = FIB_RES_NHC(res); > > update_or_create_fnhe(nhc, fl4->daddr, 0, mtu, lock, > > jiffies + net->ipv4.ip_rt_mtu_expires); > > diff --git a/tools/testing/selftests/net/pmtu.sh b/tools/testing/selftests/net/pmtu.sh > > index 569bce8b6383..a0159340fe84 100755 > > --- a/tools/testing/selftests/net/pmtu.sh > > +++ b/tools/testing/selftests/net/pmtu.sh > > @@ -266,7 +266,8 @@ tests=" > > list_flush_ipv4_exception ipv4: list and flush cached exceptions 1 > > list_flush_ipv6_exception ipv6: list and flush cached exceptions 1 > > pmtu_ipv4_route_change ipv4: PMTU exception w/route replace 1 > > - pmtu_ipv6_route_change ipv6: PMTU exception w/route replace 1" > > + pmtu_ipv6_route_change ipv6: PMTU exception w/route replace 1 > > + pmtu_ipv4_mp_exceptions ipv4: PMTU multipath nh exceptions 0" > > There is misalignment in the last field ("re-run with nh") and I think > it should be "1" in order to exercise this path with both legacy > nexthops and nexthop objects. See more below. > > In addition, this test case is missing documentation at the beginning of > the file. > > I also believe we have a problem with IPv6, but I will look into it when > I have time and add a test case. > > > > > # Addressing and routing for tests with routers: four network segments, with > > # index SEGMENT between 1 and 4, a common prefix (PREFIX4 or PREFIX6) and an > > @@ -2329,6 +2330,81 @@ test_pmtu_ipv6_route_change() { > > test_pmtu_ipvX_route_change 6 > > } > > > > +test_pmtu_ipv4_mp_exceptions() { > > + setup namespaces routing || return $ksft_skip > > + > > + ip nexthop ls >/dev/null 2>&1 > > + if [ $? -ne 0 ]; then > > + echo "Nexthop objects not supported; skipping tests" > > + exit $ksft_skip > > + fi > > Whether nexthop objects are supported or not is stored in "HAVE_NH" and > there is no need for this check here if you add setup_mp() like I > suggest below. > > > + > > + trace "${ns_a}" veth_A-R1 "${ns_r1}" veth_R1-A \ > > + "${ns_r1}" veth_R1-B "${ns_b}" veth_B-R1 \ > > + "${ns_a}" veth_A-R2 "${ns_r2}" veth_R2-A \ > > + "${ns_r2}" veth_R2-B "${ns_b}" veth_B-R2 > > + > > + dummy0_a="192.168.99.99" > > + dummy0_b="192.168.88.88" > > + > > + # Set up initial MTU values > > + mtu "${ns_a}" veth_A-R1 2000 > > + mtu "${ns_r1}" veth_R1-A 2000 > > + mtu "${ns_r1}" veth_R1-B 1500 > > + mtu "${ns_b}" veth_B-R1 1500 > > According to the documentation at the beginning of the file, the MTU > between B and R1 is 1400. See other test cases. > > > + > > + mtu "${ns_a}" veth_A-R2 2000 > > + mtu "${ns_r2}" veth_R2-A 2000 > > + mtu "${ns_r2}" veth_R2-B 1500 > > + mtu "${ns_b}" veth_B-R2 1500 > > + > > + fail=0 > > + > > + #Set up host A with multipath routes to host B dummy0_b > ^ Missing space after '#'. Same in other places. > > > + run_cmd ${ns_a} sysctl -q net.ipv4.fib_multipath_hash_policy=1 > > The test only uses ICMP, so hash policy can be L3 (0). Same for B. > > > + run_cmd ${ns_a} sysctl -q net.ipv4.ip_forward=1 > > It's a host, so no need to enable forwarding. > > > + run_cmd ${ns_a} ip link add dummy0 mtu 2000 type dummy > > + run_cmd ${ns_a} ip link set dummy0 up > > Can't you simply add the address on the loopback device instead of > creating a dummy netdev? > > > + run_cmd ${ns_a} ip addr add ${dummy0_a} dev dummy0 > > + run_cmd ${ns_a} ip nexthop add id 201 via ${prefix4}.${a_r1}.2 dev veth_A-R1 > > + run_cmd ${ns_a} ip nexthop add id 202 via ${prefix4}.${a_r2}.2 dev veth_A-R2 > > + run_cmd ${ns_a} ip nexthop add id 203 group 201/202 > > + run_cmd ${ns_a} ip route add ${dummy0_b} nhid 203 > > It would be good to test this with legacy nexthops as well. I'm not that > familiar with pmtu.sh, but I think you can add a "setup_mp" function and > invoke it via "setup namespaces routing mp [...]" at the beginning of > the function. > > Internally, setup_mp() will consult "USE_NH" and invoke setup_mp_old() > or setup_mp_new() accordingly, like setup_routing() does. > > > + > > + #Set up host B with multipath routes to host A dummy0_a > > + run_cmd ${ns_b} sysctl -q net.ipv4.fib_multipath_hash_policy=1 > > + run_cmd ${ns_b} sysctl -q net.ipv4.ip_forward=1 > > + run_cmd ${ns_b} ip link add dummy0 mtu 2000 type dummy > > + run_cmd ${ns_b} ip link set dummy0 up > > + run_cmd ${ns_b} ip addr add ${dummy0_b} dev dummy0 > > + run_cmd ${ns_b} ip nexthop add id 201 via ${prefix4}.${b_r1}.2 dev veth_A-R1 > > + run_cmd ${ns_b} ip nexthop add id 202 via ${prefix4}.${b_r2}.2 dev veth_A-R2 > > + run_cmd ${ns_b} ip nexthop add id 203 group 201/202 > > + run_cmd ${ns_b} ip route add ${dummy0_a} nhid 203 > > + > > + #Set up routers with routes to dummies > > + run_cmd ${ns_r1} ip route add ${dummy0_a} via ${prefix4}.${a_r1}.1 > > + run_cmd ${ns_r2} ip route add ${dummy0_a} via ${prefix4}.${a_r2}.1 > > + run_cmd ${ns_r1} ip route add ${dummy0_b} via ${prefix4}.${b_r1}.1 > > + run_cmd ${ns_r2} ip route add ${dummy0_b} via ${prefix4}.${b_r2}.1 > > + > > + > > + #Ping and expect two nexthop exceptions for two routes in nh group > > + run_cmd ${ns_a} ping -q -M want -i 0.1 -c 2 -s 1800 "${dummy0_b}" > > '-c 1' should be enough > > > + > > + #Do route lookup before checking cached exceptions > > IIUC, previous command created an exception in both nexthops. The > following commands are needed so that a dst entry will be cached in both > exceptions and therefore be dumped below to user space. If so, please > adjust the comment to reflect that. > > > + run_cmd ${ns_a} ip route get ${dummy0_b} oif veth_A-R1 > > + run_cmd ${ns_a} ip route get ${dummy0_b} oif veth_A-R2 > > + > > + #Check cached exceptions > > + if [ "$(${ns_a} ip -oneline route list cache| grep mtu | wc -l)" -ne 2 ]; then > ^ Missing space before the pipe > > > > + err " there are not enough cached exceptions" > > + fail=1 > > + fi > > + > > + return ${fail} > > +} > > + > > usage() { > > echo > > echo "$0 [OPTIONS] [TEST]..." > > > > base-commit: 66600fac7a984dea4ae095411f644770b2561ede > > -- > > 2.43.5 > >