Re: [LSF/MM/BPF TOPIC] The future of ZONE_DEVICE pages

Zi Yan <ziy@xxxxxxxxxx> · Fri, 31 Jan 2025 10:34:32 -0500

On 31 Jan 2025, at 0:50, Alistair Popple wrote:

> On Thu, Jan 30, 2025 at 10:58:22PM -0500, Zi Yan wrote:
>> On 30 Jan 2025, at 21:59, Alistair Popple wrote:
>>
>>> I have a few topics that I would like to discuss around ZONE_DEVICE pages
>>> and their current and future usage in the kernel. Generally these pages are
>>> used to represent various forms of device memory (PCIe BAR space, coherent
>>> accelerator memory, persistent memory, unaddressable device memory). All
>>> of these require special treatment by the core MM so many features must be
>>> implemented specifically for ZONE_DEVICE pages.
>>>
>>> I would like to get feedback on several ideas I've had for a while:
>>>
>>> Large page migration for ZONE_DEVICE pages
>>> ==========================================
>>>
>>> Currently large ZONE_DEVICE pages only exist for persistent memory use cases
>>> (DAX, FS DAX). This involves a special reference counting scheme which I hope to
>>> have fixed[1] by the time of the LSF/MM/BPF. Fixing this allows for other higher
>>> order ZONE_DEVICE folios.
>>>
>>> Specifically I would like to introduce the possiblity of migrating large CPU
>>> folios to unaddressable (DEVICE_PRIVATE) or coherent (DEVICE_COHERENT) memory.
>>> The current interfaces (migrate_vma) don't allow that as they require all folios
>>> to be split.
>>>
>>> Some of the issues are:
>>>
>>> 1. What should the interface look like?
>>>
>>> These are non-lru pages, so likely there is overlap with "non-lru page migration
>>> in a memdesc world"[2]
>>
>> It seems to me that unaddressable (DEVICE_PRIVATE) and coherent (DEVICE_COHERENT)
>> should be treated differently, since CPU cannot access the former but can access
>> the latter. Am I getting it right?
>
> In some ways there are similar (they are non-LRU pages, core-MM doesn't in
> general touch them for eg. reclaim, etc) but as you say they are also different
> in that the can be accessed directly from the CPU.
>
> The key thing they have in common though is they only get mapped into userspace
> via a device-driver explicitly migrating them there, hence why I have included
> them here.
>
>>>
>>> 2. How do we allow merging/splitting of pages during migration?
>>>
>>> This is neccessary because when migrating back from device memory there may not
>>> be enough large CPU pages available.
>>
>> It is similar to THP swap out and swap in, we just swap out a whole THP
>> but swap in individual base pages. But there is a discussion on large folio swapin[1]
>> might change it.
>>
>> [1] https://lore.kernel.org/linux-mm/58716200-fd10-4487-aed3-607a10e9fdd0@xxxxxxxxx/
>>
>>>
>>> 3. Any other issues?
>>
>> Once a large folio is migrated to device, when CPU wants to access the data, even
>> if there is enough memory in CPU memory, we might not want to migrate back the
>> entire large folio, since maybe only a base page is shared between CPU and the device.
>> Bouncing a large folio for data shared within a base page would be wasteful.
>
> Indeed. This bouncing normally happens via a migrate_to_ram() callback so I was
> thinking this would be one instance where a driver might want to split a page
> when migrating back with eg. migrate_vma_*().
>
>> I think about doing something like PCIe atomic from a device. Does it make sense?
>
> I'm not sure I follow where exactly PCIe atomics fit in here? If a page has been
> migrated to a GPU we wouldn't need PCIe atomics. Or are you saying avoiding PCIe
> atomics might be another reason a page might need to be split? (ie. CPU is doing
> atomic access to one subpage, GPU to another)

Oh, I got PCIe atomics wrong. I thought migration is needed even for PCIe
atomics. Forget about my comment on PCIe atomics.

>
>>>
>>> [1] - https://lore.kernel.org/linux-mm/cover.11189864684e31260d1408779fac9db80122047b.1736488799.git-series.apopple@xxxxxxxxxx/
>>> [2] - https://lore.kernel.org/linux-mm/2612ac8a-d0a9-452b-a53d-75ffc6166224@xxxxxxxxxx/
>>>
>>> File-backed DEVICE_PRIVATE/COHERENT pages
>>> =========================================
>>>
>>> Currently DEVICE_PRVIATE and DEVICE_COHERENT pages are only supported for
>>> private anonymous memory. This prevents devices from having local access to
>>> shared or file-backed mappings instead relying on remote DMA access which limits
>>> performance.
>>>
>>> I have been prototyping allowing ZONE_DEVICE pages in the page cache with
>>> a callback when the CPU requires access. This approach seems promising and
>>> relatively straight-forward but I would like some early feedback on either this
>>> or alternate approaches that I should investigate.
>>>
>>> Combining P2PDMA and DEVICE_PRIVATE pages
>>> =========================================
>>>
>>> Currently device memory that cannot be directly accessed via the CPU can be
>>> represented by DEVICE_PRIVATE pages allowing it to be mapped and treated like
>>> a normal virtual page by userpsace. Many devices also support accessing device
>>> memory directly from the CPU via a PCIe BAR.
>>>
>>> This access requires a P2PDMA page, meaning there are potentially two pages
>>> tracking the same piece of physical memory. This not only seems wasteful but
>>> fraught - for example device drivers need to keep page lifetimes in sync. I
>>> would like to discuss ways of solving this.
>>>
>>> DEVICE_PRIVATE pages, the linear map and the memdesc world
>>> ==========================================================
>>>
>>> DEVICE_PRIVATE pages currently reside in the linear map such that pfn_to_page()
>>> and page_to_pfn() work "as expected". However this implies a contiguous range
>>> of unused physical addresses need to be both available and allocated for device
>>> memory. This isn't always available, particularly on ARM[1] where the vmemmap
>>> region may not be large enough to accomodate the amount of device memory.
>>>
>>> However it occurs to me that (almost?) all code paths that deal with
>>> DEVICE_PRIVATE pages are already aware of this - in the case of page_to_pfn()
>>> the page can be directly queried with is_device_private_page() and in the case
>>> of pfn_to_page() the pfn has (almost?) always been obtained from a special swap
>>> entry indicating such.
>>>
>>> So does page_to_pfn()/pfn_to_page() really need to work for DEIVCE_PRIVATE
>>> pages? If not could we allocate the struct pages in a vmalloc array instead? Do
>>> we even need ZONE_DEIVCE pages/folios in a memdesc world?
>>
>> It occurs to me as well when I am reading your migration proposal above.
>> struct page is not used for DEVICE_PRIVATE, maybe it is OK to get rid of it.
>> How about DEVICE_COHERENT? Is its struct page used currently? I see AMD kfd
>> driver is using DEVICE_COHERENT (Christian König cc'd).
>
> I'm not sure removing struct page for DEVICE_COHERENT would be so straight
> forward. Unlike DEVICE_PRIVATE pages these are mapped by normal present
> PTEs so we can't rely on having a special PTE to figure out which variant of
> pfn_to_{page|memdesc|thing}() to call.
>
> On the other hand this is real memory in the physical address space, and so
> should probably be covered by the linear map anyway and have their own reserved
> region of physical address space. This is unlike DEVICE_PRIVATE entries which
> effectively need to steal some physical address space.

Got it. Like you said above, DEVICE_PRIVATE and DEVICE_COHERENT are both non-lru
pages, but only DEVICE_COHERENT can be accessed by CPU. We probably want to
categorize them differently based on DavidH’s email[1]:

DEVICE_PRIVATE: non-folio migration
DEVICE_COHERENT: non-LRU folio migration

[1] https://lore.kernel.org/linux-mm/bb0f813e-7c1b-4257-baa5-5afe18be8552@xxxxxxxxxx/

Best Regards,
Yan, Zi