On Fri, Dec 06, 2019 at 05:41:01AM -0800, Eric Dumazet wrote: > > > On 12/6/19 4:17 AM, Thadeu Lima de Souza Cascardo wrote: > > On Wed, Dec 04, 2019 at 12:03:57PM -0800, Eric Dumazet wrote: > >> > >> > >> On 12/4/19 11:53 AM, Thadeu Lima de Souza Cascardo wrote: > >>> When using fragments with size 8 and payload larger than 8000, the backlog > >>> might fill up and packets will be dropped, causing the test to fail. This > >>> happens often enough when conntrack is on during the IPv6 test. > >>> > >>> As the larger payload in the test is 10000, using a backlog of 1250 allow > >>> the test to run repeatedly without failure. At least a 1000 runs were > >>> possible with no failures, when usually less than 50 runs were good enough > >>> for showing a failure. > >>> > >>> As netdev_max_backlog is not a pernet setting, this sets the backlog to > >>> 1000 during exit to prevent disturbing following tests. > >>> > >> > >> Hmmm... I would prefer not changing a global setting like that. > >> This is going to be flaky since we often run tests in parallel (using different netns) > >> > >> What about adding a small delay after each sent packet ? > >> > >> diff --git a/tools/testing/selftests/net/ip_defrag.c b/tools/testing/selftests/net/ip_defrag.c > >> index c0c9ecb891e1d78585e0db95fd8783be31bc563a..24d0723d2e7e9b94c3e365ee2ee30e9445deafa8 100644 > >> --- a/tools/testing/selftests/net/ip_defrag.c > >> +++ b/tools/testing/selftests/net/ip_defrag.c > >> @@ -198,6 +198,7 @@ static void send_fragment(int fd_raw, struct sockaddr *addr, socklen_t alen, > >> error(1, 0, "send_fragment: %d vs %d", res, frag_len); > >> > >> frag_counter++; > >> + usleep(1000); > >> } > >> > >> static void send_udp_frags(int fd_raw, struct sockaddr *addr, > >> > > > > That won't work because the issue only shows when we using conntrack, as the > > packet will be reassembled on output, then fragmented again. When this happens, > > the fragmentation code is transmitting the fragments in a tight loop, which > > floods the backlog. > > Interesting ! > > So it looks like the test is correct, and exposed a long standing problem in this code. > > We should not adjust the test to some kernel-of-the-day-constraints, and instead fix the kernel bug ;) > > Where is this tight loop exactly ? > > If this is feeding/bursting ~1000 skbs via netif_rx() in a BH context, maybe we need to call a variant > that allows immediate processing instead of (ab)using the softnet backlog. > > Thanks ! This is the loopback interface, so its xmit calls netif_rx. I suppose we would have the same problem with veth, for example. So net/ipv6/ip6_output.c:ip6_fragment has this: for (;;) { /* Prepare header of the next frame, * before previous one went down. */ if (iter.frag) ip6_fraglist_prepare(skb, &iter); skb->tstamp = tstamp; err = output(net, sk, skb); if (!err) IP6_INC_STATS(net, ip6_dst_idev(&rt->dst), IPSTATS_MIB_FRAGCREATES); if (err || !iter.frag) break; skb = ip6_fraglist_next(&iter); } output is ip6_finish_output2, which will call neigh_output, which ends up calling dev_queue_xmit. In this case, ip6_fragment is being called probably from rawv6_send_hdrinc -> dst_output -> ip6_output -> ip6_finish_output -> __ip6_finish_output -> ip6_fragment. dst_output at rawv6_send_hdrinc is being called after netfilter NF_INET_LOCAL_OUT hook. That one is gathering the fragments and only accepting that last, reassembled skb, which causes ip6_fragment enter that loop. So, basically, the easiest way to reproduce this is using this test with loopback and netfilter doing the reassembly during conntrack. I see some BH locks here and there, but I think this is just filling up the backlog too fast to give any chance for softirq to kick in. I will see if I can reproduce this using routed veths. Cascardo. Cascardo.