Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool

Yunsheng Lin <linyunsheng@xxxxxxxxxx> · Mon, 10 Mar 2025 20:35:31 +0800

On 2025/3/10 17:13, Toke Høiland-Jørgensen wrote:

...

> 
>> Also, we might need a similar locking or synchronisation for the dma
>> sync API in order to skip the dma sync API when page_pool_destroy() is
>> called too.
> 
> Good point, but that seems a separate issue? And simpler to solve (just

If I understand the comment from DMA experts correctly, the dma_sync API
should not be allowed to be called after the dma_unmap API.

> set pool->dma_sync to false when destroying?).

Without locking or synchronisation, there is some small window between
pool->dma_sync checking and dma sync API calling, during which the driver
might have already unbound.

> 
>>> To avoid having to walk the entire xarray on unmap to find the page
>>> reference, we stash the ID assigned by xa_alloc() into the page
>>> structure itself, in the field previously called '_pp_mapping_pad' in
>>> the page_pool struct inside struct page. This field overlaps with the
>>> page->mapping pointer, which may turn out to be problematic, so an
>>> alternative is probably needed. Sticking the ID into some of the upper
>>> bits of page->pp_magic may work as an alternative, but that requires
>>> further investigation. Using the 'mapping' field works well enough as
>>> a demonstration for this RFC, though.
>> page->pp_magic seems to only have 16 bits space available at most when
>> trying to reuse some unused bits in page->pp_magic, as BPF_PTR_POISON
>> and STACK_DEPOT_POISON seems to already use 16 bits, so:
>> 1. For 32 bits system, it seems there is only 16 bits left even if the
>>     ILLEGAL_POINTER_VALUE is defined as 0x0.
>> 2. For 64 bits system, it seems there is only 12 bits left for powerpc
>>     as ILLEGAL_POINTER_VALUE is defined as 0x5deadbeef0000000, see
>>     arch/powerpc/Kconfig.
>>
>> So it seems we might need to limit the dma mapping count to 4096 or
>> 65536?
>>
>> And to be honest, I am not sure if those 'unused' 12/16 bits can really 
>> be reused or not, I guess we might need suggestion from mm and arch
>> experts here.
> 
> Why do we need to care about BPF_PTR_POISON and STACK_DEPOT_POISON?
> AFAICT, we only need to make sure we preserve the PP_SIGNATURE value.
> See v2 of the RFC patch, the bit arithmetic there gives me:
> 
> - 24 bits on 32-bit architectures
> - 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE)
> - 32 bits on other 64-bit architectures
> 
> Which seems to be plenty?

I am really doubtful it is that simple, but we always can hear from the
experts if it isn't:)

> 
>>> Since all the tracking is performed on DMA map/unmap, no additional code
>>> is needed in the fast path, meaning the performance overhead of this
>>> tracking is negligible. The extra memory needed to track the pages is
>>> neatly encapsulated inside xarray, which uses the 'struct xa_node'
>>> structure to track items. This structure is 576 bytes long, with slots
>>> for 64 items, meaning that a full node occurs only 9 bytes of overhead
>>> per slot it tracks (in practice, it probably won't be this efficient,
>>> but in any case it should be an acceptable overhead).
>>
>> Even if items is stored sequentially in xa_node at first, is it possible
>> that there may be fragmentation in those xa_node when pages are released
>> and allocated many times during packet processing? If yes, is there any
>> fragmentation info about those xa_node?
> 
> Some (that's what I mean with "not as efficient"). AFAICT, xa_array does
> do some rebalancing of the underlying radix tree, freeing nodes when
> they are no longer used. However, I am not too familiar with the xarray
> code, so I don't know exactly how efficient this is in practice.

I guess that is one of the disadvantages that an advanced struct like
Xarray is used:(

>