Re: [PATCH net-next 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool

Toke Høiland-Jørgensen <toke@xxxxxxxxxx> · Tue, 18 Mar 2025 21:55:20 +0100

Yunsheng Lin <linyunsheng@xxxxxxxxxx> writes:

> On 2025/3/17 23:16, Toke Høiland-Jørgensen wrote:
>> Yunsheng Lin <yunshenglin0825@xxxxxxxxx> writes:
>> 
>>> On 3/14/2025 6:10 PM, Toke Høiland-Jørgensen wrote:
>>>
>>> ...
>>>
>>>>
>>>> To avoid having to walk the entire xarray on unmap to find the page
>>>> reference, we stash the ID assigned by xa_alloc() into the page
>>>> structure itself, using the upper bits of the pp_magic field. This
>>>> requires a couple of defines to avoid conflicting with the
>>>> POINTER_POISON_DELTA define, but this is all evaluated at compile-time,
>>>> so does not affect run-time performance. The bitmap calculations in this
>>>> patch gives the following number of bits for different architectures:
>>>>
>>>> - 24 bits on 32-bit architectures
>>>> - 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE)
>>>> - 32 bits on other 64-bit architectures
>>>
>>>  From commit c07aea3ef4d4 ("mm: add a signature in struct page"):
>>> "The page->signature field is aliased to page->lru.next and
>>> page->compound_head, but it can't be set by mistake because the
>>> signature value is a bad pointer, and can't trigger a false positive
>>> in PageTail() because the last bit is 0."
>>>
>>> And commit 8a5e5e02fc83 ("include/linux/poison.h: fix LIST_POISON{1,2} 
>>> offset"):
>>> "Poison pointer values should be small enough to find a room in
>>> non-mmap'able/hardly-mmap'able space."
>>>
>>> So the question seems to be:
>>> 1. Is stashing the ID causing page->pp_magic to be in the mmap'able/
>>>     easier-mmap'able space? If yes, how can we make sure this will not
>>>     cause any security problem?
>>> 2. Is the masking the page->pp_magic causing a valid pionter for
>>>     page->lru.next or page->compound_head to be treated as a vaild
>>>     PP_SIGNATURE? which might cause page_pool to recycle a page not
>>>     allocated via page_pool.
>> 
>> Right, so my reasoning for why the defines in this patch works for this
>> is as follows: in both cases we need to make sure that the ID stashed in
>> that field never looks like a valid kernel pointer. For 64-bit arches
>> (where CONFIG_ILLEGAL_POINTER_VALUE), we make sure of this by never
>> writing to any bits that overlap with the illegal value (so that the
>> PP_SIGNATURE written to the field keeps it as an illegal pointer value).
>> For 32-bit arches, we make sure of this by making sure the top-most bit
>> is always 0 (the -1 in the define for _PP_DMA_INDEX_BITS) in the patch,
>> which puts it outside the range used for kernel pointers (AFAICT).
>
> Is there any season you think only kernel pointer is relevant here?

Yes. Any pointer stored in the same space as pp_magic by other users of
the page will be kernel pointers (as they come from page->lru.next). The
goal of PP_SIGNATURE is to be able to distinguish pages allocated by
page_pool, so we don't accidentally recycle a page from somewhere else.
That's the goal of the check in page_pool_page_is_pp():

(page->pp_magic & PP_MAGIC_MASK) == PP_SIGNATURE

To achieve this, we must ensure that the check above never returns true
for any value another page user could have written into the same field
(i.e., into page->lru.next). For 64-bit arches, POISON_POINTER_DELTA
serves this purpose. For 32-bit arches, we can leave the top-most bits
out of PP_MAGIC_MASK, to make sure that any valid pointer value will
fail the check above.

> It seems it is not really only about kernel pointers as round_hint_to_min()
> in mm/mmap.c suggests and the commit log in the above commit 8a5e5e02fc83
> if I understand it correctly:
> "Given unprivileged users cannot mmap anything below mmap_min_addr, it
> should be safe to use poison pointers lower than mmap_min_addr."
>
> And the above "making sure the top-most bit is always 0" doesn't seems to
> ensure page->signature to be outside the range used for kernel pointers
> for 32-bit arches with VMSPLIT_1G defined, see arch/arm/Kconfig, there
> is a similar config for x86 too:
>        prompt "Memory split"
>        depends on MMU
>        default VMSPLIT_3G
>        help
>          Select the desired split between kernel and user memory.
>
>          If you are not absolutely sure what you are doing, leave this
>          option alone!
>
>        config VMSPLIT_3G
>               bool "3G/1G user/kernel split"
>        config VMSPLIT_3G_OPT
>              depends on !ARM_LPAE
>               bool "3G/1G user/kernel split (for full 1G low memory)"
>        config VMSPLIT_2G
>               bool "2G/2G user/kernel split"
>        config VMSPLIT_1G
>               bool "1G/3G user/kernel split"

Ah, interesting, didn't know this was configurable. Okay, so AFAICT, the
lowest value of PAGE_OFFSET is 0x40000000 (for VMSPLIT_1G), so we need
to leave two bits off at the top instead of just one. Will update this,
and try to explain the logic better in the comment.

> IMHO, even if some trick like above is really feasible, it may be
> adding some limitation or complexity to the ARCH and MM subsystem, for
> example, stashing the ID in page->signature may cause 0x*40 signature
> to be unusable for other poison pointer purpose, it makes more sense to
> make it obvious by doing the above trick in some MM header file like
> poison.h instead of in the page_pool subsystem.

AFAIU, PP_SIGNATURE is used for page_pool to be able to distinguish its
own pages from those allocated elsewhere (cf the description above).
Which means that these definitions are logically page_pool-internal, and
thus it makes the most sense to keep them in the page pool headers. The
only bits the mm subsystem cares about in that field are the bottom two
(for pfmemalloc pages and compound pages).

>>>> Since all the tracking is performed on DMA map/unmap, no additional code
>>>> is needed in the fast path, meaning the performance overhead of this
>>>> tracking is negligible. A micro-benchmark shows that the total overhead
>>>> of using xarray for this purpose is about 400 ns (39 cycles(tsc) 395.218
>>>> ns; sum for both map and unmap[1]). Since this cost is only paid on DMA
>>>> map and unmap, it seems like an acceptable cost to fix the late unmap
>>>
>>> For most use cases when PP_FLAG_DMA_MAP is set and IOMMU is off, the
>>> DMA map and unmap operation is almost negligible as said below, so the
>>> cost is about 200% performance degradation, which doesn't seems like an
>>> acceptable cost.
>> 
>> I disagree. This only impacts the slow path, as long as pages are
>> recycled there is no additional cost. While your patch series has
>> demonstrated that it is *possible* to reduce the cost even in the slow
>> path, I don't think the complexity cost of this is worth it.
>
> It is still the datapath otherwise there isn't a specific testing
> for that use case, more than 200% performance degradation is too much
> IHMO.

Do you have a real-world use case (i.e., a networking benchmark, not a
micro-benchmark of the allocation function) where this change has a
measurable impact on performance? If so, please do share the numbers!

I very much doubt it will be anything close to that magnitude, but I'm
always willing to be persuaded by data :)

> Let aside the above severe performance degradation, reusing space in
> page->signature seems to be a tradeoff between adding complexity in
> page_pool subsystem and in VM/ARCH subsystem as mentioned above.

I think you are overstating the impact on other MM users; this is all
mostly page_pool-internal logic, cf the explanation above.

>> 
>> [...]
>> 
>>>> The extra memory needed to track the pages is neatly encapsulated inside
>>>> xarray, which uses the 'struct xa_node' structure to track items. This
>>>> structure is 576 bytes long, with slots for 64 items, meaning that a
>>>> full node occurs only 9 bytes of overhead per slot it tracks (in
>>>> practice, it probably won't be this efficient, but in any case it should
>>>
>>> Is there any debug infrastructure to know if it is not this efficient?
>>> as there may be 576 byte overhead for a page for the worst case.
>> 
>> There's an XA_DEBUG define which enables some dump functions, but I
>> don't think there's any API to inspect the memory usage. I guess you
>> could attach a BPF program and walk the structure, or something?
>> 
>
> It seems the XA_DEBUG is not defined in production environment.
> And I am not familiar enough with BPF program to understand if the
> BPF way is feasiable in production environment.
> Any example for the above BPF program and how to attach it?

Hmm, no, not really, sorry :(

I *think* it should be possible to write a bpftrace script that walks
the internal xarray tree and counts the nodes along the way, but it's
not trivial to do, and I haven't tried.

-Toke