I have a few topics that I would like to discuss around ZONE_DEVICE pages and their current and future usage in the kernel. Generally these pages are used to represent various forms of device memory (PCIe BAR space, coherent accelerator memory, persistent memory, unaddressable device memory). All of these require special treatment by the core MM so many features must be implemented specifically for ZONE_DEVICE pages. I would like to get feedback on several ideas I've had for a while: Large page migration for ZONE_DEVICE pages ========================================== Currently large ZONE_DEVICE pages only exist for persistent memory use cases (DAX, FS DAX). This involves a special reference counting scheme which I hope to have fixed[1] by the time of the LSF/MM/BPF. Fixing this allows for other higher order ZONE_DEVICE folios. Specifically I would like to introduce the possiblity of migrating large CPU folios to unaddressable (DEVICE_PRIVATE) or coherent (DEVICE_COHERENT) memory. The current interfaces (migrate_vma) don't allow that as they require all folios to be split. Some of the issues are: 1. What should the interface look like? These are non-lru pages, so likely there is overlap with "non-lru page migration in a memdesc world"[2] 2. How do we allow merging/splitting of pages during migration? This is neccessary because when migrating back from device memory there may not be enough large CPU pages available. 3. Any other issues? [1] - https://lore.kernel.org/linux-mm/cover.11189864684e31260d1408779fac9db80122047b.1736488799.git-series.apopple@xxxxxxxxxx/ [2] - https://lore.kernel.org/linux-mm/2612ac8a-d0a9-452b-a53d-75ffc6166224@xxxxxxxxxx/ File-backed DEVICE_PRIVATE/COHERENT pages ========================================= Currently DEVICE_PRVIATE and DEVICE_COHERENT pages are only supported for private anonymous memory. This prevents devices from having local access to shared or file-backed mappings instead relying on remote DMA access which limits performance. I have been prototyping allowing ZONE_DEVICE pages in the page cache with a callback when the CPU requires access. This approach seems promising and relatively straight-forward but I would like some early feedback on either this or alternate approaches that I should investigate. Combining P2PDMA and DEVICE_PRIVATE pages ========================================= Currently device memory that cannot be directly accessed via the CPU can be represented by DEVICE_PRIVATE pages allowing it to be mapped and treated like a normal virtual page by userpsace. Many devices also support accessing device memory directly from the CPU via a PCIe BAR. This access requires a P2PDMA page, meaning there are potentially two pages tracking the same piece of physical memory. This not only seems wasteful but fraught - for example device drivers need to keep page lifetimes in sync. I would like to discuss ways of solving this. DEVICE_PRIVATE pages, the linear map and the memdesc world ========================================================== DEVICE_PRIVATE pages currently reside in the linear map such that pfn_to_page() and page_to_pfn() work "as expected". However this implies a contiguous range of unused physical addresses need to be both available and allocated for device memory. This isn't always available, particularly on ARM[1] where the vmemmap region may not be large enough to accomodate the amount of device memory. However it occurs to me that (almost?) all code paths that deal with DEVICE_PRIVATE pages are already aware of this - in the case of page_to_pfn() the page can be directly queried with is_device_private_page() and in the case of pfn_to_page() the pfn has (almost?) always been obtained from a special swap entry indicating such. So does page_to_pfn()/pfn_to_page() really need to work for DEIVCE_PRIVATE pages? If not could we allocate the struct pages in a vmalloc array instead? Do we even need ZONE_DEIVCE pages/folios in a memdesc world? [1] - https://lore.kernel.org/linux-arm-kernel/CAMj1kXHxyntweiq76CdW=ov2_CkEQUbdPekGNDtFp7rBCJJE2w@xxxxxxxxxxxxxx/ Other issues/ideas ================== Are there any other clean-ups or features that people are interested in seeing? - Alistair