On 2/1/24 23:37, Dmitry Safonov wrote: > On 2/1/24 22:25, Dmitry Safonov wrote: >> Hi Jakub, >> >> On 2/1/24 21:21, Jakub Kicinski wrote: >>> On Thu, 1 Feb 2024 00:50:46 +0000 Dmitry Safonov wrote: >>>> Please, let me know if there will be other issues with tcp-ao tests :) >>>> >>>> Going to work on tracepoints and some other TCP-AO stuff for net-next. >>> >>> Since you're being nice and helpful I figured I'll try testing TCP-AO >>> with debug options enabled :) (kernel/configs/debug.config and >>> kernel/configs/x86_debug.config included), >> >> Haha :) >> >>> that slows things down >>> and causes a bit of flakiness in unsigned-md5-* tests: >>> >>> https://netdev.bots.linux.dev/flakes.html?br-cnt=75&tn-needle=tcp-ao >>> >>> This has links to outputs: >>> https://netdev.bots.linux.dev/contest.html?executor=vmksft-tcp-ao-dbg&pass=0 >>> >>> If it's a timing thing - FWIW we started exporting >>> KSFT_MACHINE_SLOW=yes on the slow runners. >> >> I think, I know what happens here: >> >> # ok 8 AO server (AO_REQUIRED): AO client: counter TCPAOGood increased 4 >> => 6 >> # ok 9 AO server (AO_REQUIRED): unsigned client >> # ok 10 AO server (AO_REQUIRED): unsigned client: counter TCPAORequired >> increased 1 => 2 >> # not ok 11 AO server (AO_REQUIRED): unsigned client: Counter >> netns_ao_good was not expected to increase 7 => 8 >> >> for each of tests the server listens at a new port, but re-uses the same >> namespaces+veth. If the node/machine is quite slow, I guess a segment >> might have been retransmitted and the test that initiated it had already >> finished. >> And as result, the per-namespace counters are incremented, which makes >> the test fail (IOW, the test expects all segments in ns being dropped). >> >> So, I should do one of the options: >> >> 1. relax per-namespace checks (the per-socket and per-key counters are >> checked) >> 2. unshare(net) + veth setup for each test >> 3. split the selftest on smaller ones (as they create new net-ns in >> initialization) > > Actually, I think there may be an easier fix: > > 4. Make sure that client close()s TCP-AO first, making it twsk. > And also make sure that net-ns counters read post server's close(). > > Will do this, let's see if this fixes the flakiness on the netdev bot :) FWIW, I ended up with this: https://lore.kernel.org/all/20240202-unsigned-md5-netns-counters-v1-1-8b90c37c0566@xxxxxxxxxx/ I reproduced the issue once, running unsigned-md5* in a loop, while in another terminal building linux-next with all cores. With the patch above, it survived 77 iterations of both ipv4/ipv6 tests so far. So, there is a chance it fixes the issue :) Thanks, Dmitry