On 2024/10/24 22:40, Toke Høiland-Jørgensen wrote: ... >>> >>> I really really dislike this approach! >>> >>> Nacked-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx> >>> >>> Having to keep an array to record all the pages including the ones >>> which are handed over to network stack, goes against the very principle >>> behind page_pool. We added members to struct page, such that pages could >>> be "outstanding". >> >> Before and after this patch both support "outstanding", the difference is >> how many "outstanding" pages do they support. >> >> The question seems to be do we really need unlimited inflight page for >> page_pool to work as mentioned in [1]? >> >> 1. https://lore.kernel.org/all/5d9ea7bd-67bb-4a9d-a120-c8f290c31a47@xxxxxxxxxx/ > > Well, yes? Imposing an arbitrary limit on the number of in-flight > packets (especially such a low one as in this series) is a complete > non-starter. Servers have hundreds of gigs of memory these days, and if > someone wants to use that for storing in-flight packets, the kernel > definitely shouldn't impose some (hard-coded!) limit on that. You and Jesper seems to be mentioning a possible fact that there might be 'hundreds of gigs of memory' needed for inflight pages, it would be nice to provide more info or reasoning above why 'hundreds of gigs of memory' is needed here so that we don't do a over-designed thing to support recording unlimited in-flight pages if the driver unbound stalling turns out impossible and the inflight pages do need to be recorded. I guess it is common sense to start with easy one until someone complains with some testcase and detailed reasoning if we need to go the hard way as you and Jesper are also prefering waiting over having to record the inflight pages. More detailed about why we might need to go the hard way of having to record the inflight pages as below. > >>> >>> The page_pool already have a system for waiting for these outstanding / >>> inflight packets to get returned. As I suggested before, the page_pool >>> should simply take over the responsability (from net_device) to free the >>> struct device (after inflight reach zero), where AFAIK the DMA device is >>> connected via. >> >> It seems you mentioned some similar suggestion in previous version, >> it would be good to show some code about the idea in your mind, I am sure >> that Yonglong Liu Cc'ed will be happy to test it if there some code like >> POC/RFC is provided. > > I believe Jesper is basically referring to Jakub's RFC that you > mentioned below. > >> I should mention that it seems that DMA device is not longer vaild when >> remove() function of the device driver returns, as mentioned in [2], which >> means dma API is not allowed to called after remove() function of the device >> driver returns. >> >> 2. https://www.spinics.net/lists/netdev/msg1030641.html >> >>> >>> The alternative is what Kuba suggested (and proposed an RFC for), that >>> the net_device teardown waits for the page_pool inflight packets. >> >> As above, the question is how long does the waiting take here? >> Yonglong tested Kuba's RFC, see [3], the waiting took forever due to >> reason as mentioned in commit log: >> "skb_defer_free_flush(): this may cause infinite delay if there is no >> triggering for net_rx_action()." > > Honestly, this just seems like a bug (the "no triggering of > net_rx_action()") that should be root caused and fixed; not a reason > that waiting can't work. I would prefer the waiting too if simple waiting fixed the test cases that Youglong and Haiqing were reporting and I did not look into the rabbit hole of possible caching in networking. As mentioned in commit log and [1]: 1. ipv4 packet defragmentation timeout: this seems to cause delay up to 30 secs, which was reported by Haiqing. 2. skb_defer_free_flush(): this may cause infinite delay if there is no triggering for net_rx_action(), which was reported by Yonglong. For case 1, is it really ok to stall the driver unbound up to 30 secs for the default setting of defragmentation timeout? For case 2, it is possible to add timeout for those kind of caching like the defragmentation timeout too, but as mentioned in [2], it seems to be a normal thing for this kind of caching in networking: "Eric pointed out/predicted there's no guarantee that applications will read / close their sockets so a page pool page may be stuck in a socket (but not leaked) forever." 1. https://lore.kernel.org/all/2c5ccfff-6ab4-4aea-bff6-3679ff72cc9a@xxxxxxxxxx/ 2. https://www.spinics.net/lists/netdev/msg952604.html > > -Toke > >