On 31 Jan 2025, at 0:50, Alistair Popple wrote: > On Thu, Jan 30, 2025 at 10:58:22PM -0500, Zi Yan wrote: >> On 30 Jan 2025, at 21:59, Alistair Popple wrote: >> >>> I have a few topics that I would like to discuss around ZONE_DEVICE pages >>> and their current and future usage in the kernel. Generally these pages are >>> used to represent various forms of device memory (PCIe BAR space, coherent >>> accelerator memory, persistent memory, unaddressable device memory). All >>> of these require special treatment by the core MM so many features must be >>> implemented specifically for ZONE_DEVICE pages. >>> >>> I would like to get feedback on several ideas I've had for a while: >>> >>> Large page migration for ZONE_DEVICE pages >>> ========================================== >>> >>> Currently large ZONE_DEVICE pages only exist for persistent memory use cases >>> (DAX, FS DAX). This involves a special reference counting scheme which I hope to >>> have fixed[1] by the time of the LSF/MM/BPF. Fixing this allows for other higher >>> order ZONE_DEVICE folios. >>> >>> Specifically I would like to introduce the possiblity of migrating large CPU >>> folios to unaddressable (DEVICE_PRIVATE) or coherent (DEVICE_COHERENT) memory. >>> The current interfaces (migrate_vma) don't allow that as they require all folios >>> to be split. >>> >>> Some of the issues are: >>> >>> 1. What should the interface look like? >>> >>> These are non-lru pages, so likely there is overlap with "non-lru page migration >>> in a memdesc world"[2] >> >> It seems to me that unaddressable (DEVICE_PRIVATE) and coherent (DEVICE_COHERENT) >> should be treated differently, since CPU cannot access the former but can access >> the latter. Am I getting it right? > > In some ways there are similar (they are non-LRU pages, core-MM doesn't in > general touch them for eg. reclaim, etc) but as you say they are also different > in that the can be accessed directly from the CPU. > > The key thing they have in common though is they only get mapped into userspace > via a device-driver explicitly migrating them there, hence why I have included > them here. > >>> >>> 2. How do we allow merging/splitting of pages during migration? >>> >>> This is neccessary because when migrating back from device memory there may not >>> be enough large CPU pages available. >> >> It is similar to THP swap out and swap in, we just swap out a whole THP >> but swap in individual base pages. But there is a discussion on large folio swapin[1] >> might change it. >> >> [1] https://lore.kernel.org/linux-mm/58716200-fd10-4487-aed3-607a10e9fdd0@xxxxxxxxx/ >> >>> >>> 3. Any other issues? >> >> Once a large folio is migrated to device, when CPU wants to access the data, even >> if there is enough memory in CPU memory, we might not want to migrate back the >> entire large folio, since maybe only a base page is shared between CPU and the device. >> Bouncing a large folio for data shared within a base page would be wasteful. > > Indeed. This bouncing normally happens via a migrate_to_ram() callback so I was > thinking this would be one instance where a driver might want to split a page > when migrating back with eg. migrate_vma_*(). > >> I think about doing something like PCIe atomic from a device. Does it make sense? > > I'm not sure I follow where exactly PCIe atomics fit in here? If a page has been > migrated to a GPU we wouldn't need PCIe atomics. Or are you saying avoiding PCIe > atomics might be another reason a page might need to be split? (ie. CPU is doing > atomic access to one subpage, GPU to another) Oh, I got PCIe atomics wrong. I thought migration is needed even for PCIe atomics. Forget about my comment on PCIe atomics. > >>> >>> [1] - https://lore.kernel.org/linux-mm/cover.11189864684e31260d1408779fac9db80122047b.1736488799.git-series.apopple@xxxxxxxxxx/ >>> [2] - https://lore.kernel.org/linux-mm/2612ac8a-d0a9-452b-a53d-75ffc6166224@xxxxxxxxxx/ >>> >>> File-backed DEVICE_PRIVATE/COHERENT pages >>> ========================================= >>> >>> Currently DEVICE_PRVIATE and DEVICE_COHERENT pages are only supported for >>> private anonymous memory. This prevents devices from having local access to >>> shared or file-backed mappings instead relying on remote DMA access which limits >>> performance. >>> >>> I have been prototyping allowing ZONE_DEVICE pages in the page cache with >>> a callback when the CPU requires access. This approach seems promising and >>> relatively straight-forward but I would like some early feedback on either this >>> or alternate approaches that I should investigate. >>> >>> Combining P2PDMA and DEVICE_PRIVATE pages >>> ========================================= >>> >>> Currently device memory that cannot be directly accessed via the CPU can be >>> represented by DEVICE_PRIVATE pages allowing it to be mapped and treated like >>> a normal virtual page by userpsace. Many devices also support accessing device >>> memory directly from the CPU via a PCIe BAR. >>> >>> This access requires a P2PDMA page, meaning there are potentially two pages >>> tracking the same piece of physical memory. This not only seems wasteful but >>> fraught - for example device drivers need to keep page lifetimes in sync. I >>> would like to discuss ways of solving this. >>> >>> DEVICE_PRIVATE pages, the linear map and the memdesc world >>> ========================================================== >>> >>> DEVICE_PRIVATE pages currently reside in the linear map such that pfn_to_page() >>> and page_to_pfn() work "as expected". However this implies a contiguous range >>> of unused physical addresses need to be both available and allocated for device >>> memory. This isn't always available, particularly on ARM[1] where the vmemmap >>> region may not be large enough to accomodate the amount of device memory. >>> >>> However it occurs to me that (almost?) all code paths that deal with >>> DEVICE_PRIVATE pages are already aware of this - in the case of page_to_pfn() >>> the page can be directly queried with is_device_private_page() and in the case >>> of pfn_to_page() the pfn has (almost?) always been obtained from a special swap >>> entry indicating such. >>> >>> So does page_to_pfn()/pfn_to_page() really need to work for DEIVCE_PRIVATE >>> pages? If not could we allocate the struct pages in a vmalloc array instead? Do >>> we even need ZONE_DEIVCE pages/folios in a memdesc world? >> >> It occurs to me as well when I am reading your migration proposal above. >> struct page is not used for DEVICE_PRIVATE, maybe it is OK to get rid of it. >> How about DEVICE_COHERENT? Is its struct page used currently? I see AMD kfd >> driver is using DEVICE_COHERENT (Christian König cc'd). > > I'm not sure removing struct page for DEVICE_COHERENT would be so straight > forward. Unlike DEVICE_PRIVATE pages these are mapped by normal present > PTEs so we can't rely on having a special PTE to figure out which variant of > pfn_to_{page|memdesc|thing}() to call. > > On the other hand this is real memory in the physical address space, and so > should probably be covered by the linear map anyway and have their own reserved > region of physical address space. This is unlike DEVICE_PRIVATE entries which > effectively need to steal some physical address space. Got it. Like you said above, DEVICE_PRIVATE and DEVICE_COHERENT are both non-lru pages, but only DEVICE_COHERENT can be accessed by CPU. We probably want to categorize them differently based on DavidH’s email[1]: DEVICE_PRIVATE: non-folio migration DEVICE_COHERENT: non-LRU folio migration [1] https://lore.kernel.org/linux-mm/bb0f813e-7c1b-4257-baa5-5afe18be8552@xxxxxxxxxx/ Best Regards, Yan, Zi