> > > On 19/04/2023 16.21, Lorenzo Bianconi wrote: > > > > > > On 19/04/2023 14.09, Eric Dumazet wrote: > > > > On Wed, Apr 19, 2023 at 1:08 PM Jesper Dangaard Brouer > > > > > > > > > > > > > > > On 18/04/2023 09.36, Lorenzo Bianconi wrote: > > > > > > > On Mon, 17 Apr 2023 23:31:01 +0200 Lorenzo Bianconi wrote: > > > > > > > > > If it's that then I'm with Eric. There are many ways to keep the pages > > > > > > > > > in use, no point working around one of them and not the rest :( > > > > > > > > > > > > > > > > I was not clear here, my fault. What I mean is I can see the returned > > > > > > > > pages counter increasing from time to time, but during most of tests, > > > > > > > > even after 2h the tcp traffic has stopped, page_pool_release_retry() > > > > > > > > still complains not all the pages are returned to the pool and so the > > > > > > > > pool has not been deallocated yet. > > > > > > > > The chunk of code in my first email is just to demonstrate the issue > > > > > > > > and I am completely fine to get a better solution :) > > > > > > > > > > > > > > Your problem is perhaps made worse by threaded NAPI, you have > > > > > > > defer-free skbs sprayed across all cores and no NAPI there to > > > > > > > flush them :( > > > > > > > > > > > > yes, exactly :) > > > > > > > > > > > > > > > > > > > > > I guess we just need a way to free the pool in a reasonable amount > > > > > > > > of time. Agree? > > > > > > > > > > > > > > Whether we need to guarantee the release is the real question. > > > > > > > > > > > > yes, this is the main goal of my email. The defer-free skbs behaviour seems in > > > > > > contrast with the page_pool pending pages monitor mechanism or at least they > > > > > > do not work well together. > > > > > > > > > > > > @Jesper, Ilias: any input on it? > > > > > > > > > > > > > Maybe it's more of a false-positive warning. > > > > > > > > > > > > > > Flushing the defer list is probably fine as a hack, but it's not > > > > > > > a full fix as Eric explained. False positive can still happen. > > > > > > > > > > > > agree, it was just a way to give an idea of the issue, not a proper solution. > > > > > > > > > > > > Regards, > > > > > > Lorenzo > > > > > > > > > > > > > > > > > > > > I'm ambivalent. My only real request wold be to make the flushing > > > > > > > a helper in net/core/dev.c rather than open coded in page_pool.c. > > > > > > > > > > I agree. We need a central defer_list flushing helper > > > > > > > > > > It is too easy to say this is a false-positive warning. > > > > > IHMO this expose an issue with the sd->defer_list system. > > > > > > > > > > Lorenzo's test is adding+removing veth devices, which creates and runs > > > > > NAPI processing on random CPUs. After veth netdevices (+NAPI) are > > > > > removed, nothing will naturally invoking net_rx_softirq on this CPU. > > > > > Thus, we have SKBs waiting on CPUs sd->defer_list. Further more we will > > > > > not create new SKB with this skb->alloc_cpu, to trigger RX softirq IPI > > > > > call (trigger_rx_softirq), even if this CPU process and frees SKBs. > > > > > > > > > > I see two solutions: > > > > > > > > > > (1) When netdevice/NAPI unregister happens call defer_list flushing > > > > > helper. > > > > > > > > > > (2) Use napi_watchdog to detect if defer_list is (many jiffies) old, > > > > > and then call defer_list flushing helper. > > > > > > > > > > > > > > > > > > > > > > > > Somewhat related - Eric, do we need to handle defer_list in dev_cpu_dead()? > > > > > > > > > > Looks to me like dev_cpu_dead() also need this flushing helper for > > > > > sd->defer_list, or at least moving the sd->defer_list to an sd that will > > > > > run eventually. > > > > > > > > I think I just considered having a few skbs in per-cpu list would not > > > > be an issue, > > > > especially considering skbs can sit hours in tcp receive queues. > > > > > > > > > > It was the first thing I said to Lorenzo when he first reported the > > > problem to me (over chat): It is likely packets sitting in a TCP queue. > > > Then I instructed him to look at output from netstat to see queues and > > > look for TIME-WAIT, FIN-WAIT etc. > > > > > > > > > > Do we expect hacing some kind of callback/shrinker to instruct TCP or > > > > pipes to release all pages that prevent > > > > a page_pool to be freed ? > > > > > > > > > > This is *not* what I'm asking for. > > > > > > With TCP sockets (pipes etc) we can take care of closing the sockets > > > (and programs etc) to free up the SKBs (and perhaps wait for timeouts) > > > to make sure the page_pool shutdown doesn't hang. > > > > > > The problem arise for all the selftests that uses veth and bpf_test_run > > > (using bpf_test_run_xdp_live / xdp_test_run_setup). For the selftests > > > we obviously take care of closing sockets and removing veth interfaces > > > again. Problem: The defer_list corner-case isn't under our control. > > > > > > > > > > Here, we are talking of hundreds of thousands of skbs, compared to at > > > > most 32 skbs per cpu. > > > > > > > > > > It is not a memory usage concern. > > > > > > > Perhaps sets sysctl_skb_defer_max to zero by default, so that admins > > > > can opt-in > > > > > > > > > > I really like the sd->defer_list system and I think is should be enabled > > > by default. Even if disabled by default, we still need to handle these > > > corner cases, as the selftests shouldn't start to cause-issues when this > > > gets enabled. > > > > > > The simple solution is: (1) When netdevice/NAPI unregister happens call > > > defer_list flushing helper. And perhaps we also need to call it in > > > xdp_test_run_teardown(). How do you feel about that? > > > > > > --Jesper > > > > > > > Today I was discussing with Toke about this issue, and we were wondering, > > if we just consider the page_pool use-case, what about moving the real pool > > destroying steps when we return a page to the pool in page_pool_put_full_page() > > if the pool has marked to be destroyed and there are no inflight pages instead > > of assuming we have all the pages in the pool when we run page_pool_destroy()? > > It sounds like you want to add a runtime check to the fast-path to > handle these corner cases? > > For performance reason we should not call page_pool_inflight() check in > fast-path, please! ack, right. > > Details: You hopefully mean running/calling page_pool_release(pool) and not > page_pool_destroy(). yes, I mean page_pool_release() > > I'm not totally against the idea, as long as someone is willing to do > extensive benchmarking that it doesn't affect fast-path performance. > Given we already read pool->p.flags in fast-path, it might be possible > to hide the extra branch (in the CPU pipeline). > > > > Maybe this means just get rid of the warn in page_pool_release_retry() :) > > > > Sure, we can remove the print statement, but it feels like closing our > eyes and ignoring the problem. We can remove the print statement, and > still debug the problem, as I have added tracepoints (to debug this). > But users will not report these issue early... on the other hand most of > these reports will likely be false-positives. > > This reminds me that Jakub's recent defer patches returning pages > 'directly' to the page_pool alloc-cache, will actually result in this > kind of bug. This is because page_pool_destroy() assumes that pages > cannot be returned to alloc-cache, as driver will have "disconnected" RX > side. We need to address this bug separately. Lorenzo you didn't > happen to use a kernel with Jakub's patches included, do you? nope, I did not tested them. Regards, Lorenzo > > --Jesper > > >
Attachment:
signature.asc
Description: PGP signature