Re: issue with inflight pages from page_pool

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 19/04/2023 14.09, Eric Dumazet wrote:
On Wed, Apr 19, 2023 at 1:08 PM Jesper Dangaard Brouer


On 18/04/2023 09.36, Lorenzo Bianconi wrote:
On Mon, 17 Apr 2023 23:31:01 +0200 Lorenzo Bianconi wrote:
If it's that then I'm with Eric. There are many ways to keep the pages
in use, no point working around one of them and not the rest :(

I was not clear here, my fault. What I mean is I can see the returned
pages counter increasing from time to time, but during most of tests,
even after 2h the tcp traffic has stopped, page_pool_release_retry()
still complains not all the pages are returned to the pool and so the
pool has not been deallocated yet.
The chunk of code in my first email is just to demonstrate the issue
and I am completely fine to get a better solution :)

Your problem is perhaps made worse by threaded NAPI, you have
defer-free skbs sprayed across all cores and no NAPI there to
flush them :(

yes, exactly :)


I guess we just need a way to free the pool in a reasonable amount
of time. Agree?

Whether we need to guarantee the release is the real question.

yes, this is the main goal of my email. The defer-free skbs behaviour seems in
contrast with the page_pool pending pages monitor mechanism or at least they
do not work well together.

@Jesper, Ilias: any input on it?

Maybe it's more of a false-positive warning.

Flushing the defer list is probably fine as a hack, but it's not
a full fix as Eric explained. False positive can still happen.

agree, it was just a way to give an idea of the issue, not a proper solution.

Regards,
Lorenzo


I'm ambivalent. My only real request wold be to make the flushing
a helper in net/core/dev.c rather than open coded in page_pool.c.

I agree. We need a central defer_list flushing helper

It is too easy to say this is a false-positive warning.
IHMO this expose an issue with the sd->defer_list system.

Lorenzo's test is adding+removing veth devices, which creates and runs
NAPI processing on random CPUs.  After veth netdevices (+NAPI) are
removed, nothing will naturally invoking net_rx_softirq on this CPU.
Thus, we have SKBs waiting on CPUs sd->defer_list.  Further more we will
not create new SKB with this skb->alloc_cpu, to trigger RX softirq IPI
call (trigger_rx_softirq), even if this CPU process and frees SKBs.

I see two solutions:

   (1) When netdevice/NAPI unregister happens call defer_list flushing
helper.

   (2) Use napi_watchdog to detect if defer_list is (many jiffies) old,
and then call defer_list flushing helper.



Somewhat related - Eric, do we need to handle defer_list in dev_cpu_dead()?

Looks to me like dev_cpu_dead() also need this flushing helper for
sd->defer_list, or at least moving the sd->defer_list to an sd that will
run eventually.

I think I just considered having a few skbs in per-cpu list would not
be an issue,
especially considering skbs can sit hours in tcp receive queues.


It was the first thing I said to Lorenzo when he first reported the
problem to me (over chat): It is likely packets sitting in a TCP queue.
Then I instructed him to look at output from netstat to see queues and
look for TIME-WAIT, FIN-WAIT etc.


Do we expect hacing some kind of callback/shrinker to instruct TCP or
pipes to release all pages that prevent
a page_pool to be freed ?


This is *not* what I'm asking for.

With TCP sockets (pipes etc) we can take care of closing the sockets
(and programs etc) to free up the SKBs (and perhaps wait for timeouts)
to make sure the page_pool shutdown doesn't hang.

The problem arise for all the selftests that uses veth and bpf_test_run
(using bpf_test_run_xdp_live / xdp_test_run_setup).  For the selftests
we obviously take care of closing sockets and removing veth interfaces
again.  Problem: The defer_list corner-case isn't under our control.


Here, we are talking of hundreds of thousands of skbs, compared to at
most 32 skbs per cpu.


It is not a memory usage concern.

Perhaps sets sysctl_skb_defer_max to zero by default, so that admins
can opt-in


I really like the sd->defer_list system and I think is should be enabled
by default.  Even if disabled by default, we still need to handle these
corner cases, as the selftests shouldn't start to cause-issues when this
gets enabled.

The simple solution is: (1) When netdevice/NAPI unregister happens call
defer_list flushing helper.  And perhaps we also need to call it in
xdp_test_run_teardown().  How do you feel about that?

--Jesper




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux