Re: [PATCH net-next v3 3/3] page_pool: fix IOMMU crash when driver has already unbound

Toke Høiland-Jørgensen <toke@xxxxxxxxxx> · Thu, 24 Oct 2024 16:40:40 +0200

Yunsheng Lin <linyunsheng@xxxxxxxxxx> writes:

> On 2024/10/23 2:14, Jesper Dangaard Brouer wrote:
>> 
>> 
>> On 22/10/2024 05.22, Yunsheng Lin wrote:
>>> Networking driver with page_pool support may hand over page
>>> still with dma mapping to network stack and try to reuse that
>>> page after network stack is done with it and passes it back
>>> to page_pool to avoid the penalty of dma mapping/unmapping.
>>> With all the caching in the network stack, some pages may be
>>> held in the network stack without returning to the page_pool
>>> soon enough, and with VF disable causing the driver unbound,
>>> the page_pool does not stop the driver from doing it's
>>> unbounding work, instead page_pool uses workqueue to check
>>> if there is some pages coming back from the network stack
>>> periodically, if there is any, it will do the dma unmmapping
>>> related cleanup work.
>>>
>>> As mentioned in [1], attempting DMA unmaps after the driver
>>> has already unbound may leak resources or at worst corrupt
>>> memory. Fundamentally, the page pool code cannot allow DMA
>>> mappings to outlive the driver they belong to.
>>>
>>> Currently it seems there are at least two cases that the page
>>> is not released fast enough causing dma unmmapping done after
>>> driver has already unbound:
>>> 1. ipv4 packet defragmentation timeout: this seems to cause
>>>     delay up to 30 secs.
>>> 2. skb_defer_free_flush(): this may cause infinite delay if
>>>     there is no triggering for net_rx_action().
>>>
>>> In order not to do the dma unmmapping after driver has already
>>> unbound and stall the unloading of the networking driver, add
>>> the pool->items array to record all the pages including the ones
>>> which are handed over to network stack, so the page_pool can
>> 
>> I really really dislike this approach!
>> 
>> Nacked-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx>
>> 
>> Having to keep an array to record all the pages including the ones
>> which are handed over to network stack, goes against the very principle
>> behind page_pool. We added members to struct page, such that pages could
>> be "outstanding".
>
> Before and after this patch both support "outstanding", the difference is
> how many "outstanding" pages do they support.
>
> The question seems to be do we really need unlimited inflight page for
> page_pool to work as mentioned in [1]?
>
> 1. https://lore.kernel.org/all/5d9ea7bd-67bb-4a9d-a120-c8f290c31a47@xxxxxxxxxx/

Well, yes? Imposing an arbitrary limit on the number of in-flight
packets (especially such a low one as in this series) is a complete
non-starter. Servers have hundreds of gigs of memory these days, and if
someone wants to use that for storing in-flight packets, the kernel
definitely shouldn't impose some (hard-coded!) limit on that.

>> 
>> The page_pool already have a system for waiting for these outstanding /
>> inflight packets to get returned.  As I suggested before, the page_pool
>> should simply take over the responsability (from net_device) to free the
>> struct device (after inflight reach zero), where AFAIK the DMA device is
>> connected via.
>
> It seems you mentioned some similar suggestion in previous version,
> it would be good to show some code about the idea in your mind, I am sure
> that Yonglong Liu Cc'ed will be happy to test it if there some code like
> POC/RFC is provided.

I believe Jesper is basically referring to Jakub's RFC that you
mentioned below.

> I should mention that it seems that DMA device is not longer vaild when
> remove() function of the device driver returns, as mentioned in [2], which
> means dma API is not allowed to called after remove() function of the device
> driver returns.
>
> 2. https://www.spinics.net/lists/netdev/msg1030641.html
>
>> 
>> The alternative is what Kuba suggested (and proposed an RFC for),  that
>> the net_device teardown waits for the page_pool inflight packets.
>
> As above, the question is how long does the waiting take here?
> Yonglong tested Kuba's RFC, see [3], the waiting took forever due to
> reason as mentioned in commit log:
> "skb_defer_free_flush(): this may cause infinite delay if there is no
> triggering for net_rx_action()."

Honestly, this just seems like a bug (the "no triggering of
net_rx_action()") that should be root caused and fixed; not a reason
that waiting can't work.

-Toke