Yunsheng Lin <linyunsheng@xxxxxxxxxx> writes: > On 2024/10/23 2:14, Jesper Dangaard Brouer wrote: >> >> >> On 22/10/2024 05.22, Yunsheng Lin wrote: >>> Networking driver with page_pool support may hand over page >>> still with dma mapping to network stack and try to reuse that >>> page after network stack is done with it and passes it back >>> to page_pool to avoid the penalty of dma mapping/unmapping. >>> With all the caching in the network stack, some pages may be >>> held in the network stack without returning to the page_pool >>> soon enough, and with VF disable causing the driver unbound, >>> the page_pool does not stop the driver from doing it's >>> unbounding work, instead page_pool uses workqueue to check >>> if there is some pages coming back from the network stack >>> periodically, if there is any, it will do the dma unmmapping >>> related cleanup work. >>> >>> As mentioned in [1], attempting DMA unmaps after the driver >>> has already unbound may leak resources or at worst corrupt >>> memory. Fundamentally, the page pool code cannot allow DMA >>> mappings to outlive the driver they belong to. >>> >>> Currently it seems there are at least two cases that the page >>> is not released fast enough causing dma unmmapping done after >>> driver has already unbound: >>> 1. ipv4 packet defragmentation timeout: this seems to cause >>> delay up to 30 secs. >>> 2. skb_defer_free_flush(): this may cause infinite delay if >>> there is no triggering for net_rx_action(). >>> >>> In order not to do the dma unmmapping after driver has already >>> unbound and stall the unloading of the networking driver, add >>> the pool->items array to record all the pages including the ones >>> which are handed over to network stack, so the page_pool can >> >> I really really dislike this approach! >> >> Nacked-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx> >> >> Having to keep an array to record all the pages including the ones >> which are handed over to network stack, goes against the very principle >> behind page_pool. We added members to struct page, such that pages could >> be "outstanding". > > Before and after this patch both support "outstanding", the difference is > how many "outstanding" pages do they support. > > The question seems to be do we really need unlimited inflight page for > page_pool to work as mentioned in [1]? > > 1. https://lore.kernel.org/all/5d9ea7bd-67bb-4a9d-a120-c8f290c31a47@xxxxxxxxxx/ Well, yes? Imposing an arbitrary limit on the number of in-flight packets (especially such a low one as in this series) is a complete non-starter. Servers have hundreds of gigs of memory these days, and if someone wants to use that for storing in-flight packets, the kernel definitely shouldn't impose some (hard-coded!) limit on that. >> >> The page_pool already have a system for waiting for these outstanding / >> inflight packets to get returned. As I suggested before, the page_pool >> should simply take over the responsability (from net_device) to free the >> struct device (after inflight reach zero), where AFAIK the DMA device is >> connected via. > > It seems you mentioned some similar suggestion in previous version, > it would be good to show some code about the idea in your mind, I am sure > that Yonglong Liu Cc'ed will be happy to test it if there some code like > POC/RFC is provided. I believe Jesper is basically referring to Jakub's RFC that you mentioned below. > I should mention that it seems that DMA device is not longer vaild when > remove() function of the device driver returns, as mentioned in [2], which > means dma API is not allowed to called after remove() function of the device > driver returns. > > 2. https://www.spinics.net/lists/netdev/msg1030641.html > >> >> The alternative is what Kuba suggested (and proposed an RFC for), that >> the net_device teardown waits for the page_pool inflight packets. > > As above, the question is how long does the waiting take here? > Yonglong tested Kuba's RFC, see [3], the waiting took forever due to > reason as mentioned in commit log: > "skb_defer_free_flush(): this may cause infinite delay if there is no > triggering for net_rx_action()." Honestly, this just seems like a bug (the "no triggering of net_rx_action()") that should be root caused and fixed; not a reason that waiting can't work. -Toke