[no subject]

**Date** **Thread**

ID stashing seems to cause page->lru.next (aliased to page->pp_magic) to
be in the mmap'able space for some arches.

> 
>> It seems it is not really only about kernel pointers as round_hint_to_min()
>> in mm/mmap.c suggests and the commit log in the above commit 8a5e5e02fc83
>> if I understand it correctly:
>> "Given unprivileged users cannot mmap anything below mmap_min_addr, it
>> should be safe to use poison pointers lower than mmap_min_addr."
>>
>> And the above "making sure the top-most bit is always 0" doesn't seems to
>> ensure page->signature to be outside the range used for kernel pointers
>> for 32-bit arches with VMSPLIT_1G defined, see arch/arm/Kconfig, there
>> is a similar config for x86 too:

...

> 
> Ah, interesting, didn't know this was configurable. Okay, so AFAICT, the
> lowest value of PAGE_OFFSET is 0x40000000 (for VMSPLIT_1G), so we need
> to leave two bits off at the top instead of just one. Will update this,
> and try to explain the logic better in the comment.

It seems there was attempt of doing 4G/4G split too, and that is the kind
of limitation or complexity added to the ARCH and MM subsystem by doing the
ID stashing I mentioned earlier.
https://lore.kernel.org/lkml/Pine.LNX.4.44.0307082332450.17252-100000@localhost.localdomain/

> 
>> IMHO, even if some trick like above is really feasible, it may be
>> adding some limitation or complexity to the ARCH and MM subsystem, for
>> example, stashing the ID in page->signature may cause 0x*40 signature
>> to be unusable for other poison pointer purpose, it makes more sense to
>> make it obvious by doing the above trick in some MM header file like
>> poison.h instead of in the page_pool subsystem.
> 
> AFAIU, PP_SIGNATURE is used for page_pool to be able to distinguish its
> own pages from those allocated elsewhere (cf the description above).
> Which means that these definitions are logically page_pool-internal, and
> thus it makes the most sense to keep them in the page pool headers. The
> only bits the mm subsystem cares about in that field are the bottom two
> (for pfmemalloc pages and compound pages).

All I asked is about moving PP_MAGIC_MASK macro into poison.h if you
still want to proceed with reusing the page->pp_magic as the masking and
the signature to be masked seems reasonable to be in the same file.

> 
>>>>> Since all the tracking is performed on DMA map/unmap, no additional code
>>>>> is needed in the fast path, meaning the performance overhead of this
>>>>> tracking is negligible. A micro-benchmark shows that the total overhead
>>>>> of using xarray for this purpose is about 400 ns (39 cycles(tsc) 395.218
>>>>> ns; sum for both map and unmap[1]). Since this cost is only paid on DMA
>>>>> map and unmap, it seems like an acceptable cost to fix the late unmap
>>>>
>>>> For most use cases when PP_FLAG_DMA_MAP is set and IOMMU is off, the
>>>> DMA map and unmap operation is almost negligible as said below, so the
>>>> cost is about 200% performance degradation, which doesn't seems like an
>>>> acceptable cost.
>>>
>>> I disagree. This only impacts the slow path, as long as pages are
>>> recycled there is no additional cost. While your patch series has
>>> demonstrated that it is *possible* to reduce the cost even in the slow
>>> path, I don't think the complexity cost of this is worth it.
>>
>> It is still the datapath otherwise there isn't a specific testing
>> for that use case, more than 200% performance degradation is too much
>> IHMO.
> 
> Do you have a real-world use case (i.e., a networking benchmark, not a
> micro-benchmark of the allocation function) where this change has a
> measurable impact on performance? If so, please do share the numbers!

I don't have one yet.

> 
> I very much doubt it will be anything close to that magnitude, but I'm
> always willing to be persuaded by data :)
> 
>> Let aside the above severe performance degradation, reusing space in
>> page->signature seems to be a tradeoff between adding complexity in
>> page_pool subsystem and in VM/ARCH subsystem as mentioned above.
> 
> I think you are overstating the impact on other MM users; this is all
> mostly page_pool-internal logic, cf the explanation above.

To be honest, I am not that familiar with the pointer poison mechanism.
But through some researching and analyzing, it makes sense to overstate
it a little as it seems to be security-related.
Cc'ed some security-related experts and ML to see if there is some
clarifying from them.

> 
>>>
>>> [...]
>>>
>>>>> The extra memory needed to track the pages is neatly encapsulated inside
>>>>> xarray, which uses the 'struct xa_node' structure to track items. This
>>>>> structure is 576 bytes long, with slots for 64 items, meaning that a
>>>>> full node occurs only 9 bytes of overhead per slot it tracks (in
>>>>> practice, it probably won't be this efficient, but in any case it should
>>>>
>>>> Is there any debug infrastructure to know if it is not this efficient?
>>>> as there may be 576 byte overhead for a page for the worst case.
>>>
>>> There's an XA_DEBUG define which enables some dump functions, but I
>>> don't think there's any API to inspect the memory usage. I guess you
>>> could attach a BPF program and walk the structure, or something?
>>>
>>
>> It seems the XA_DEBUG is not defined in production environment.
>> And I am not familiar enough with BPF program to understand if the
>> BPF way is feasiable in production environment.
>> Any example for the above BPF program and how to attach it?
> 
> Hmm, no, not really, sorry :(
> 
> I *think* it should be possible to write a bpftrace script that walks
> the internal xarray tree and counts the nodes along the way, but it's
> not trivial to do, and I haven't tried.
> 
> -Toke
> 
>