Re: issue with inflight pages from page_pool

Jesper Dangaard Brouer <jbrouer@xxxxxxxxxx> · Wed, 19 Apr 2023 13:08:22 +0200

On 18/04/2023 09.36, Lorenzo Bianconi wrote:
On Mon, 17 Apr 2023 23:31:01 +0200 Lorenzo Bianconi wrote:
If it's that then I'm with Eric. There are many ways to keep the pages
in use, no point working around one of them and not the rest :(

I was not clear here, my fault. What I mean is I can see the returned
pages counter increasing from time to time, but during most of tests,
even after 2h the tcp traffic has stopped, page_pool_release_retry()
still complains not all the pages are returned to the pool and so the
pool has not been deallocated yet.
The chunk of code in my first email is just to demonstrate the issue
and I am completely fine to get a better solution :)

Your problem is perhaps made worse by threaded NAPI, you have
defer-free skbs sprayed across all cores and no NAPI there to
flush them :(

yes, exactly :)

I guess we just need a way to free the pool in a reasonable amount
of time. Agree?

Whether we need to guarantee the release is the real question.

yes, this is the main goal of my email. The defer-free skbs behaviour seems in
contrast with the page_pool pending pages monitor mechanism or at least they
do not work well together.

@Jesper, Ilias: any input on it?

Maybe it's more of a false-positive warning.

Flushing the defer list is probably fine as a hack, but it's not
a full fix as Eric explained. False positive can still happen.

agree, it was just a way to give an idea of the issue, not a proper solution.

Regards,
Lorenzo

I'm ambivalent. My only real request wold be to make the flushing
a helper in net/core/dev.c rather than open coded in page_pool.c.

I agree. We need a central defer_list flushing helper

It is too easy to say this is a false-positive warning.
IHMO this expose an issue with the sd->defer_list system.

Lorenzo's test is adding+removing veth devices, which creates and runs
NAPI processing on random CPUs.  After veth netdevices (+NAPI) are
removed, nothing will naturally invoking net_rx_softirq on this CPU.
Thus, we have SKBs waiting on CPUs sd->defer_list.  Further more we will
not create new SKB with this skb->alloc_cpu, to trigger RX softirq IPI
call (trigger_rx_softirq), even if this CPU process and frees SKBs.

I see two solutions:

 (1) When netdevice/NAPI unregister happens call defer_list flushing 
helper.

 (2) Use napi_watchdog to detect if defer_list is (many jiffies) old, 
and then call defer_list flushing helper.

Somewhat related - Eric, do we need to handle defer_list in dev_cpu_dead()?

Looks to me like dev_cpu_dead() also need this flushing helper for
sd->defer_list, or at least moving the sd->defer_list to an sd that will
run eventually.

--Jesper