Re: [LSF/MM/BPF TOPIC] The future of ZONE_DEVICE pages

Zi Yan <ziy@xxxxxxxxxx> · Thu, 30 Jan 2025 22:58:22 -0500

On 30 Jan 2025, at 21:59, Alistair Popple wrote:

> I have a few topics that I would like to discuss around ZONE_DEVICE pages
> and their current and future usage in the kernel. Generally these pages are
> used to represent various forms of device memory (PCIe BAR space, coherent
> accelerator memory, persistent memory, unaddressable device memory). All
> of these require special treatment by the core MM so many features must be
> implemented specifically for ZONE_DEVICE pages.
>
> I would like to get feedback on several ideas I've had for a while:
>
> Large page migration for ZONE_DEVICE pages
> ==========================================
>
> Currently large ZONE_DEVICE pages only exist for persistent memory use cases
> (DAX, FS DAX). This involves a special reference counting scheme which I hope to
> have fixed[1] by the time of the LSF/MM/BPF. Fixing this allows for other higher
> order ZONE_DEVICE folios.
>
> Specifically I would like to introduce the possiblity of migrating large CPU
> folios to unaddressable (DEVICE_PRIVATE) or coherent (DEVICE_COHERENT) memory.
> The current interfaces (migrate_vma) don't allow that as they require all folios
> to be split.
>
> Some of the issues are:
>
> 1. What should the interface look like?
>
> These are non-lru pages, so likely there is overlap with "non-lru page migration
> in a memdesc world"[2]

It seems to me that unaddressable (DEVICE_PRIVATE) and coherent (DEVICE_COHERENT)
should be treated differently, since CPU cannot access the former but can access
the latter. Am I getting it right?

>
> 2. How do we allow merging/splitting of pages during migration?
>
> This is neccessary because when migrating back from device memory there may not
> be enough large CPU pages available.

It is similar to THP swap out and swap in, we just swap out a whole THP
but swap in individual base pages. But there is a discussion on large folio swapin[1]
might change it.

[1] https://lore.kernel.org/linux-mm/58716200-fd10-4487-aed3-607a10e9fdd0@xxxxxxxxx/

>
> 3. Any other issues?

Once a large folio is migrated to device, when CPU wants to access the data, even
if there is enough memory in CPU memory, we might not want to migrate back the
entire large folio, since maybe only a base page is shared between CPU and the device.
Bouncing a large folio for data shared within a base page would be wasteful.
I think about doing something like PCIe atomic from a device. Does it make sense?

>
> [1] - https://lore.kernel.org/linux-mm/cover.11189864684e31260d1408779fac9db80122047b.1736488799.git-series.apopple@xxxxxxxxxx/
> [2] - https://lore.kernel.org/linux-mm/2612ac8a-d0a9-452b-a53d-75ffc6166224@xxxxxxxxxx/
>
> File-backed DEVICE_PRIVATE/COHERENT pages
> =========================================
>
> Currently DEVICE_PRVIATE and DEVICE_COHERENT pages are only supported for
> private anonymous memory. This prevents devices from having local access to
> shared or file-backed mappings instead relying on remote DMA access which limits
> performance.
>
> I have been prototyping allowing ZONE_DEVICE pages in the page cache with
> a callback when the CPU requires access. This approach seems promising and
> relatively straight-forward but I would like some early feedback on either this
> or alternate approaches that I should investigate.
>
> Combining P2PDMA and DEVICE_PRIVATE pages
> =========================================
>
> Currently device memory that cannot be directly accessed via the CPU can be
> represented by DEVICE_PRIVATE pages allowing it to be mapped and treated like
> a normal virtual page by userpsace. Many devices also support accessing device
> memory directly from the CPU via a PCIe BAR.
>
> This access requires a P2PDMA page, meaning there are potentially two pages
> tracking the same piece of physical memory. This not only seems wasteful but
> fraught - for example device drivers need to keep page lifetimes in sync. I
> would like to discuss ways of solving this.
>
> DEVICE_PRIVATE pages, the linear map and the memdesc world
> ==========================================================
>
> DEVICE_PRIVATE pages currently reside in the linear map such that pfn_to_page()
> and page_to_pfn() work "as expected". However this implies a contiguous range
> of unused physical addresses need to be both available and allocated for device
> memory. This isn't always available, particularly on ARM[1] where the vmemmap
> region may not be large enough to accomodate the amount of device memory.
>
> However it occurs to me that (almost?) all code paths that deal with
> DEVICE_PRIVATE pages are already aware of this - in the case of page_to_pfn()
> the page can be directly queried with is_device_private_page() and in the case
> of pfn_to_page() the pfn has (almost?) always been obtained from a special swap
> entry indicating such.
>
> So does page_to_pfn()/pfn_to_page() really need to work for DEIVCE_PRIVATE
> pages? If not could we allocate the struct pages in a vmalloc array instead? Do
> we even need ZONE_DEIVCE pages/folios in a memdesc world?

It occurs to me as well when I am reading your migration proposal above.
struct page is not used for DEVICE_PRIVATE, maybe it is OK to get rid of it.
How about DEVICE_COHERENT? Is its struct page used currently? I see AMD kfd
driver is using DEVICE_COHERENT (Christian König cc'd).

--
Best Regards,
Yan, Zi