On Wed, Apr 19, 2023 at 1:08 PM Jesper Dangaard Brouer <jbrouer@xxxxxxxxxx> wrote: > > > On 18/04/2023 09.36, Lorenzo Bianconi wrote: > >> On Mon, 17 Apr 2023 23:31:01 +0200 Lorenzo Bianconi wrote: > >>>> If it's that then I'm with Eric. There are many ways to keep the pages > >>>> in use, no point working around one of them and not the rest :( > >>> > >>> I was not clear here, my fault. What I mean is I can see the returned > >>> pages counter increasing from time to time, but during most of tests, > >>> even after 2h the tcp traffic has stopped, page_pool_release_retry() > >>> still complains not all the pages are returned to the pool and so the > >>> pool has not been deallocated yet. > >>> The chunk of code in my first email is just to demonstrate the issue > >>> and I am completely fine to get a better solution :) > >> > >> Your problem is perhaps made worse by threaded NAPI, you have > >> defer-free skbs sprayed across all cores and no NAPI there to > >> flush them :( > > > > yes, exactly :) > > > >> > >>> I guess we just need a way to free the pool in a reasonable amount > >>> of time. Agree? > >> > >> Whether we need to guarantee the release is the real question. > > > > yes, this is the main goal of my email. The defer-free skbs behaviour seems in > > contrast with the page_pool pending pages monitor mechanism or at least they > > do not work well together. > > > > @Jesper, Ilias: any input on it? > > > >> Maybe it's more of a false-positive warning. > >> > >> Flushing the defer list is probably fine as a hack, but it's not > >> a full fix as Eric explained. False positive can still happen. > > > > agree, it was just a way to give an idea of the issue, not a proper solution. > > > > Regards, > > Lorenzo > > > >> > >> I'm ambivalent. My only real request wold be to make the flushing > >> a helper in net/core/dev.c rather than open coded in page_pool.c. > > I agree. We need a central defer_list flushing helper > > It is too easy to say this is a false-positive warning. > IHMO this expose an issue with the sd->defer_list system. > > Lorenzo's test is adding+removing veth devices, which creates and runs > NAPI processing on random CPUs. After veth netdevices (+NAPI) are > removed, nothing will naturally invoking net_rx_softirq on this CPU. > Thus, we have SKBs waiting on CPUs sd->defer_list. Further more we will > not create new SKB with this skb->alloc_cpu, to trigger RX softirq IPI > call (trigger_rx_softirq), even if this CPU process and frees SKBs. > > I see two solutions: > > (1) When netdevice/NAPI unregister happens call defer_list flushing > helper. > > (2) Use napi_watchdog to detect if defer_list is (many jiffies) old, > and then call defer_list flushing helper. > > > >> > >> Somewhat related - Eric, do we need to handle defer_list in dev_cpu_dead()? > > Looks to me like dev_cpu_dead() also need this flushing helper for > sd->defer_list, or at least moving the sd->defer_list to an sd that will > run eventually. I think I just considered having a few skbs in per-cpu list would not be an issue, especially considering skbs can sit hours in tcp receive queues. Do we expect hacing some kind of callback/shrinker to instruct TCP or pipes to release all pages that prevent a page_pool to be freed ? Here, we are talking of hundreds of thousands of skbs, compared to at most 32 skbs per cpu. Perhaps sets sysctl_skb_defer_max to zero by default, so that admins can opt-in