Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool

Yunsheng Lin <yunshenglin0825@xxxxxxxxx> · Sun, 9 Mar 2025 21:27:17 +0800

On 3/8/2025 10:54 PM, Toke Høiland-Jørgensen wrote:
When enabling DMA mapping in page_pool, pages are kept DMA mapped until
they are released from the pool, to avoid the overhead of re-mapping the
pages every time they are used. This causes problems when a device is
torn down, because the page pool can't unmap the pages until they are
returned to the pool. This causes resource leaks and/or crashes when
there are pages still outstanding while the device is torn down, because
page_pool will attempt an unmap of a non-existent DMA device on the
subsequent page return.

To fix this, implement a simple tracking of outstanding dma-mapped pages
in page pool using an xarray. This was first suggested by Mina[0], and
turns out to be fairly straight forward: We simply store pointers to
pages directly in the xarray with xa_alloc() when they are first DMA
mapped, and remove them from the array on unmap. Then, when a page pool
is torn down, it can simply walk the xarray and unmap all pages still
present there before returning, which also allows us to get rid of the
get/put_device() calls in page_pool. Using xa_cmpxchg(), no additional
synchronisation is needed, as a page will only ever be unmapped once.

The implementation of xa_cmpxchg() seems to take the xa_lock, which
seems to be a per-Xarray spin_lock.
Yes, if if we were to take a per-Xarray lock unconditionaly, additional
synchronisation like rcu doesn't seems to be needed. But it seems an
excessive overhead for normal packet processing when page_pool_destroy()
is not called yet?

Also, we might need a similar locking or synchronisation for the dma
sync API in order to skip the dma sync API when page_pool_destroy() is
called too.

To avoid having to walk the entire xarray on unmap to find the page
reference, we stash the ID assigned by xa_alloc() into the page
structure itself, in the field previously called '_pp_mapping_pad' in
the page_pool struct inside struct page. This field overlaps with the
page->mapping pointer, which may turn out to be problematic, so an
alternative is probably needed. Sticking the ID into some of the upper
bits of page->pp_magic may work as an alternative, but that requires
further investigation. Using the 'mapping' field works well enough as
a demonstration for this RFC, though.
page->pp_magic seems to only have 16 bits space available at most when
trying to reuse some unused bits in page->pp_magic, as BPF_PTR_POISON
and STACK_DEPOT_POISON seems to already use 16 bits, so:
1. For 32 bits system, it seems there is only 16 bits left even if the
   ILLEGAL_POINTER_VALUE is defined as 0x0.
2. For 64 bits system, it seems there is only 12 bits left for powerpc
   as ILLEGAL_POINTER_VALUE is defined as 0x5deadbeef0000000, see
   arch/powerpc/Kconfig.

So it seems we might need to limit the dma mapping count to 4096 or
65536?

And to be honest, I am not sure if those 'unused' 12/16 bits can really 
be reused or not, I guess we might need suggestion from mm and arch
experts here.

Since all the tracking is performed on DMA map/unmap, no additional code
is needed in the fast path, meaning the performance overhead of this
tracking is negligible. The extra memory needed to track the pages is
neatly encapsulated inside xarray, which uses the 'struct xa_node'
structure to track items. This structure is 576 bytes long, with slots
for 64 items, meaning that a full node occurs only 9 bytes of overhead
per slot it tracks (in practice, it probably won't be this efficient,
but in any case it should be an acceptable overhead).

Even if items is stored sequentially in xa_node at first, is it possible
that there may be fragmentation in those xa_node when pages are released
and allocated many times during packet processing? If yes, is there any
fragmentation info about those xa_node?

[0] https://lore.kernel.org/all/CAHS8izPg7B5DwKfSuzz-iOop_YRbk3Sd6Y4rX7KBG9DcVJcyWg@xxxxxxxxxxxxxx/

Fixes: ff7d6b27f894 ("page_pool: refurbish version of page_pool code")
Reported-by: Yonglong Liu <liuyonglong@xxxxxxxxxx>
Suggested-by: Mina Almasry <almasrymina@xxxxxxxxxx>
Reviewed-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx>
Tested-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx>
Signed-off-by: Toke Høiland-Jørgensen <toke@xxxxxxxxxx>
---
This is an alternative to Yunsheng's series. Yunsheng requested I send
this as an RFC to better be able to discuss the different approaches; see
some initial discussion in[1], also regarding where to store the ID as
alluded to above.

As mentioned before, I am not really convinced there is still any
space left in 'struct page' yet, otherwise we might already use that
space to fix the DMA address > 32 bits problem in 32 bits system, see
page_pool_set_dma_addr_netmem().

Also, Using the more space in 'struct page' for the page_pool seems to
make page_pool more coupled to the mm subsystem, which seems to not
align with the folios work that is trying to decouple non-mm subsystem
from the mm subsystem by avoid other subsystem using more of the 'struct
page' as metadata from the long term point of view.

-Toke

[1] https://lore.kernel.org/r/40b33879-509a-4c4a-873b-b5d3573b6e14@xxxxxxxxx