Re: [PATCH net-next 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool

Yunsheng Lin <linyunsheng@xxxxxxxxxx> · Tue, 18 Mar 2025 20:39:51 +0800

On 2025/3/17 23:16, Toke Høiland-Jørgensen wrote:
> Yunsheng Lin <yunshenglin0825@xxxxxxxxx> writes:
> 
>> On 3/14/2025 6:10 PM, Toke Høiland-Jørgensen wrote:
>>
>> ...
>>
>>>
>>> To avoid having to walk the entire xarray on unmap to find the page
>>> reference, we stash the ID assigned by xa_alloc() into the page
>>> structure itself, using the upper bits of the pp_magic field. This
>>> requires a couple of defines to avoid conflicting with the
>>> POINTER_POISON_DELTA define, but this is all evaluated at compile-time,
>>> so does not affect run-time performance. The bitmap calculations in this
>>> patch gives the following number of bits for different architectures:
>>>
>>> - 24 bits on 32-bit architectures
>>> - 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE)
>>> - 32 bits on other 64-bit architectures
>>
>>  From commit c07aea3ef4d4 ("mm: add a signature in struct page"):
>> "The page->signature field is aliased to page->lru.next and
>> page->compound_head, but it can't be set by mistake because the
>> signature value is a bad pointer, and can't trigger a false positive
>> in PageTail() because the last bit is 0."
>>
>> And commit 8a5e5e02fc83 ("include/linux/poison.h: fix LIST_POISON{1,2} 
>> offset"):
>> "Poison pointer values should be small enough to find a room in
>> non-mmap'able/hardly-mmap'able space."
>>
>> So the question seems to be:
>> 1. Is stashing the ID causing page->pp_magic to be in the mmap'able/
>>     easier-mmap'able space? If yes, how can we make sure this will not
>>     cause any security problem?
>> 2. Is the masking the page->pp_magic causing a valid pionter for
>>     page->lru.next or page->compound_head to be treated as a vaild
>>     PP_SIGNATURE? which might cause page_pool to recycle a page not
>>     allocated via page_pool.
> 
> Right, so my reasoning for why the defines in this patch works for this
> is as follows: in both cases we need to make sure that the ID stashed in
> that field never looks like a valid kernel pointer. For 64-bit arches
> (where CONFIG_ILLEGAL_POINTER_VALUE), we make sure of this by never
> writing to any bits that overlap with the illegal value (so that the
> PP_SIGNATURE written to the field keeps it as an illegal pointer value).
> For 32-bit arches, we make sure of this by making sure the top-most bit
> is always 0 (the -1 in the define for _PP_DMA_INDEX_BITS) in the patch,
> which puts it outside the range used for kernel pointers (AFAICT).

Is there any season you think only kernel pointer is relevant here?
It seems it is not really only about kernel pointers as round_hint_to_min()
in mm/mmap.c suggests and the commit log in the above commit 8a5e5e02fc83
if I understand it correctly:
"Given unprivileged users cannot mmap anything below mmap_min_addr, it
should be safe to use poison pointers lower than mmap_min_addr."

And the above "making sure the top-most bit is always 0" doesn't seems to
ensure page->signature to be outside the range used for kernel pointers
for 32-bit arches with VMSPLIT_1G defined, see arch/arm/Kconfig, there
is a similar config for x86 too:
       prompt "Memory split"
       depends on MMU
       default VMSPLIT_3G
       help
         Select the desired split between kernel and user memory.

         If you are not absolutely sure what you are doing, leave this
         option alone!

       config VMSPLIT_3G
              bool "3G/1G user/kernel split"
       config VMSPLIT_3G_OPT
             depends on !ARM_LPAE
              bool "3G/1G user/kernel split (for full 1G low memory)"
       config VMSPLIT_2G
              bool "2G/2G user/kernel split"
       config VMSPLIT_1G
              bool "1G/3G user/kernel split"

IMHO, even if some trick like above is really feasible, it may be
adding some limitation or complexity to the ARCH and MM subsystem, for
example, stashing the ID in page->signature may cause 0x*40 signature
to be unusable for other poison pointer purpose, it makes more sense to
make it obvious by doing the above trick in some MM header file like
poison.h instead of in the page_pool subsystem.

> 
>>> Since all the tracking is performed on DMA map/unmap, no additional code
>>> is needed in the fast path, meaning the performance overhead of this
>>> tracking is negligible. A micro-benchmark shows that the total overhead
>>> of using xarray for this purpose is about 400 ns (39 cycles(tsc) 395.218
>>> ns; sum for both map and unmap[1]). Since this cost is only paid on DMA
>>> map and unmap, it seems like an acceptable cost to fix the late unmap
>>
>> For most use cases when PP_FLAG_DMA_MAP is set and IOMMU is off, the
>> DMA map and unmap operation is almost negligible as said below, so the
>> cost is about 200% performance degradation, which doesn't seems like an
>> acceptable cost.
> 
> I disagree. This only impacts the slow path, as long as pages are
> recycled there is no additional cost. While your patch series has
> demonstrated that it is *possible* to reduce the cost even in the slow
> path, I don't think the complexity cost of this is worth it.

It is still the datapath otherwise there isn't a specific testing
for that use case, more than 200% performance degradation is too much
IHMO.

Let aside the above severe performance degradation, reusing space in
page->signature seems to be a tradeoff between adding complexity in
page_pool subsystem and in VM/ARCH subsystem as mentioned above.

> 
> [...]
> 
>>> The extra memory needed to track the pages is neatly encapsulated inside
>>> xarray, which uses the 'struct xa_node' structure to track items. This
>>> structure is 576 bytes long, with slots for 64 items, meaning that a
>>> full node occurs only 9 bytes of overhead per slot it tracks (in
>>> practice, it probably won't be this efficient, but in any case it should
>>
>> Is there any debug infrastructure to know if it is not this efficient?
>> as there may be 576 byte overhead for a page for the worst case.
> 
> There's an XA_DEBUG define which enables some dump functions, but I
> don't think there's any API to inspect the memory usage. I guess you
> could attach a BPF program and walk the structure, or something?
> 

It seems the XA_DEBUG is not defined in production environment.
And I am not familiar enough with BPF program to understand if the
BPF way is feasiable in production environment.
Any example for the above BPF program and how to attach it?