The patch titled Subject: mm: allow compound zone device pages has been added to the -mm mm-unstable branch. Its filename is mm-allow-compound-zone-device-pages.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-allow-compound-zone-device-pages.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Alistair Popple <apopple@xxxxxxxxxx> Subject: mm: allow compound zone device pages Date: Wed, 5 Feb 2025 09:48:08 +1100 Zone device pages are used to represent various type of device memory managed by device drivers. Currently compound zone device pages are not supported. This is because MEMORY_DEVICE_FS_DAX pages are the only user of higher order zone device pages and have their own page reference counting. A future change will unify FS DAX reference counting with normal page reference counting rules and remove the special FS DAX reference counting. Supporting that requires compound zone device pages. Supporting compound zone device pages requires compound_head() to distinguish between head and tail pages whilst still preserving the special struct page fields that are specific to zone device pages. A tail page is distinguished by having bit zero being set in page->compound_head, with the remaining bits pointing to the head page. For zone device pages page->compound_head is shared with page->pgmap. The page->pgmap field must be common to all pages within a folio, even if the folio spans memory sections. Therefore pgmap is the same for both head and tail pages and can be moved into the folio and we can use the standard scheme to find compound_head from a tail page. Link: https://lkml.kernel.org/r/9faf181c9e0d93607b6eb41ac89931730fe0749e.1738709036.git-series.apopple@xxxxxxxxxx Signed-off-by: Alistair Popple <apopple@xxxxxxxxxx> Reviewed-by: Jason Gunthorpe <jgg@xxxxxxxxxx> Reviewed-by: Dan Williams <dan.j.williams@xxxxxxxxx> Acked-by: David Hildenbrand <david@xxxxxxxxxx> Tested-by: Alison Schofield <alison.schofield@xxxxxxxxx> Cc: Alexander Gordeev <agordeev@xxxxxxxxxxxxx> Cc: Asahi Lina <lina@xxxxxxxxxxxxx> Cc: Bjorn Helgaas <bhelgaas@xxxxxxxxxx> Cc: Catalin Marinas <catalin.marinas@xxxxxxx> Cc: Christian Borntraeger <borntraeger@xxxxxxxxxxxxx> Cc: Christoph Hellwig <hch@xxxxxx> Cc: Chunyan Zhang <zhang.lyra@xxxxxxxxx> Cc: "Darrick J. Wong" <djwong@xxxxxxxxxx> Cc: Dave Chinner <david@xxxxxxxxxxxxx> Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx> Cc: Dave Jiang <dave.jiang@xxxxxxxxx> Cc: Gerald Schaefer <gerald.schaefer@xxxxxxxxxxxxx> Cc: Heiko Carstens <hca@xxxxxxxxxxxxx> Cc: Huacai Chen <chenhuacai@xxxxxxxxxx> Cc: Ira Weiny <ira.weiny@xxxxxxxxx> Cc: Jan Kara <jack@xxxxxxx> Cc: Jason Gunthorpe <jgg@xxxxxxxx> Cc: John Hubbard <jhubbard@xxxxxxxxxx> Cc: linmiaohe <linmiaohe@xxxxxxxxxx> Cc: Logan Gunthorpe <logang@xxxxxxxxxxxx> Cc: Mattew Wilcox <willy@xxxxxxxxxxxxx> Cc: Michael Ellerman <mpe@xxxxxxxxxxxxxx> Cc: Nicholas Piggin <npiggin@xxxxxxxxx> Cc: Peter Xu <peterx@xxxxxxxxxx> Cc: Sven Schnelle <svens@xxxxxxxxxxxxx> Cc: Ted Ts'o <tytso@xxxxxxx> Cc: Vasily Gorbik <gor@xxxxxxxxxxxxx> Cc: Vishal Verma <vishal.l.verma@xxxxxxxxx> Cc: Vivek Goyal <vgoyal@xxxxxxxxxx> Cc: WANG Xuerui <kernel@xxxxxxxxxx> Cc: Will Deacon <will@xxxxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- drivers/gpu/drm/nouveau/nouveau_dmem.c | 3 ++- drivers/pci/p2pdma.c | 6 +++--- include/linux/memremap.h | 6 +++--- include/linux/migrate.h | 4 ++-- include/linux/mm_types.h | 9 +++++++-- include/linux/mmzone.h | 12 +++++++++++- lib/test_hmm.c | 3 ++- mm/hmm.c | 2 +- mm/memory.c | 4 +++- mm/memremap.c | 14 +++++++------- mm/migrate_device.c | 7 +++++-- mm/mlock.c | 2 ++ mm/mm_init.c | 2 +- 13 files changed, 49 insertions(+), 25 deletions(-) --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c~mm-allow-compound-zone-device-pages +++ a/drivers/gpu/drm/nouveau/nouveau_dmem.c @@ -88,7 +88,8 @@ struct nouveau_dmem { static struct nouveau_dmem_chunk *nouveau_page_to_chunk(struct page *page) { - return container_of(page->pgmap, struct nouveau_dmem_chunk, pagemap); + return container_of(page_pgmap(page), struct nouveau_dmem_chunk, + pagemap); } static struct nouveau_drm *page_to_drm(struct page *page) --- a/drivers/pci/p2pdma.c~mm-allow-compound-zone-device-pages +++ a/drivers/pci/p2pdma.c @@ -202,7 +202,7 @@ static const struct attribute_group p2pm static void p2pdma_page_free(struct page *page) { - struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap); + struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_pgmap(page)); /* safe to dereference while a reference is held to the percpu ref */ struct pci_p2pdma *p2pdma = rcu_dereference_protected(pgmap->provider->p2pdma, 1); @@ -1025,8 +1025,8 @@ enum pci_p2pdma_map_type pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev, struct scatterlist *sg) { - if (state->pgmap != sg_page(sg)->pgmap) { - state->pgmap = sg_page(sg)->pgmap; + if (state->pgmap != page_pgmap(sg_page(sg))) { + state->pgmap = page_pgmap(sg_page(sg)); state->map = pci_p2pdma_map_type(state->pgmap, dev); state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset; } --- a/include/linux/memremap.h~mm-allow-compound-zone-device-pages +++ a/include/linux/memremap.h @@ -161,7 +161,7 @@ static inline bool is_device_private_pag { return IS_ENABLED(CONFIG_DEVICE_PRIVATE) && is_zone_device_page(page) && - page->pgmap->type == MEMORY_DEVICE_PRIVATE; + page_pgmap(page)->type == MEMORY_DEVICE_PRIVATE; } static inline bool folio_is_device_private(const struct folio *folio) @@ -173,13 +173,13 @@ static inline bool is_pci_p2pdma_page(co { return IS_ENABLED(CONFIG_PCI_P2PDMA) && is_zone_device_page(page) && - page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA; + page_pgmap(page)->type == MEMORY_DEVICE_PCI_P2PDMA; } static inline bool is_device_coherent_page(const struct page *page) { return is_zone_device_page(page) && - page->pgmap->type == MEMORY_DEVICE_COHERENT; + page_pgmap(page)->type == MEMORY_DEVICE_COHERENT; } static inline bool folio_is_device_coherent(const struct folio *folio) --- a/include/linux/migrate.h~mm-allow-compound-zone-device-pages +++ a/include/linux/migrate.h @@ -205,8 +205,8 @@ struct migrate_vma { unsigned long end; /* - * Set to the owner value also stored in page->pgmap->owner for - * migrating out of device private memory. The flags also need to + * Set to the owner value also stored in page_pgmap(page)->owner + * for migrating out of device private memory. The flags also need to * be set to MIGRATE_VMA_SELECT_DEVICE_PRIVATE. * The caller should always set this field when using mmu notifier * callbacks to avoid device MMU invalidations for device private --- a/include/linux/mm_types.h~mm-allow-compound-zone-device-pages +++ a/include/linux/mm_types.h @@ -129,8 +129,11 @@ struct page { unsigned long compound_head; /* Bit zero is set */ }; struct { /* ZONE_DEVICE pages */ - /** @pgmap: Points to the hosting device page map. */ - struct dev_pagemap *pgmap; + /* + * The first word is used for compound_head or folio + * pgmap + */ + void *_unused_pgmap_compound_head; void *zone_device_data; /* * ZONE_DEVICE private pages are counted as being @@ -299,6 +302,7 @@ typedef struct { * @_refcount: Do not access this member directly. Use folio_ref_count() * to find how many references there are to this folio. * @memcg_data: Memory Control Group data. + * @pgmap: Metadata for ZONE_DEVICE mappings * @virtual: Virtual address in the kernel direct map. * @_last_cpupid: IDs of last CPU and last process that accessed the folio. * @_entire_mapcount: Do not use directly, call folio_entire_mapcount(). @@ -337,6 +341,7 @@ struct folio { /* private: */ }; /* public: */ + struct dev_pagemap *pgmap; }; struct address_space *mapping; pgoff_t index; --- a/include/linux/mmzone.h~mm-allow-compound-zone-device-pages +++ a/include/linux/mmzone.h @@ -1174,6 +1174,12 @@ static inline bool is_zone_device_page(c return page_zonenum(page) == ZONE_DEVICE; } +static inline struct dev_pagemap *page_pgmap(const struct page *page) +{ + VM_WARN_ON_ONCE_PAGE(!is_zone_device_page(page), page); + return page_folio(page)->pgmap; +} + /* * Consecutive zone device pages should not be merged into the same sgl * or bvec segment with other types of pages or if they belong to different @@ -1189,7 +1195,7 @@ static inline bool zone_device_pages_hav return false; if (!is_zone_device_page(a)) return true; - return a->pgmap == b->pgmap; + return page_pgmap(a) == page_pgmap(b); } extern void memmap_init_zone_device(struct zone *, unsigned long, @@ -1204,6 +1210,10 @@ static inline bool zone_device_pages_hav { return true; } +static inline struct dev_pagemap *page_pgmap(const struct page *page) +{ + return NULL; +} #endif static inline bool folio_is_zone_device(const struct folio *folio) --- a/lib/test_hmm.c~mm-allow-compound-zone-device-pages +++ a/lib/test_hmm.c @@ -195,7 +195,8 @@ static int dmirror_fops_release(struct i static struct dmirror_chunk *dmirror_page_to_chunk(struct page *page) { - return container_of(page->pgmap, struct dmirror_chunk, pagemap); + return container_of(page_pgmap(page), struct dmirror_chunk, + pagemap); } static struct dmirror_device *dmirror_page_to_device(struct page *page) --- a/mm/hmm.c~mm-allow-compound-zone-device-pages +++ a/mm/hmm.c @@ -248,7 +248,7 @@ static int hmm_vma_handle_pte(struct mm_ * just report the PFN. */ if (is_device_private_entry(entry) && - pfn_swap_entry_to_page(entry)->pgmap->owner == + page_pgmap(pfn_swap_entry_to_page(entry))->owner == range->dev_private_owner) { cpu_flags = HMM_PFN_VALID; if (is_writable_device_private_entry(entry)) --- a/mm/memory.c~mm-allow-compound-zone-device-pages +++ a/mm/memory.c @@ -4317,6 +4317,7 @@ vm_fault_t do_swap_page(struct vm_fault vmf->page = pfn_swap_entry_to_page(entry); ret = remove_device_exclusive_entry(vmf); } else if (is_device_private_entry(entry)) { + struct dev_pagemap *pgmap; if (vmf->flags & FAULT_FLAG_VMA_LOCK) { /* * migrate_to_ram is not yet ready to operate @@ -4341,7 +4342,8 @@ vm_fault_t do_swap_page(struct vm_fault */ get_page(vmf->page); pte_unmap_unlock(vmf->pte, vmf->ptl); - ret = vmf->page->pgmap->ops->migrate_to_ram(vmf); + pgmap = page_pgmap(vmf->page); + ret = pgmap->ops->migrate_to_ram(vmf); put_page(vmf->page); } else if (is_hwpoison_entry(entry)) { ret = VM_FAULT_HWPOISON; --- a/mm/memremap.c~mm-allow-compound-zone-device-pages +++ a/mm/memremap.c @@ -458,8 +458,8 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap); void free_zone_device_folio(struct folio *folio) { - if (WARN_ON_ONCE(!folio->page.pgmap->ops || - !folio->page.pgmap->ops->page_free)) + if (WARN_ON_ONCE(!folio->pgmap->ops || + !folio->pgmap->ops->page_free)) return; mem_cgroup_uncharge(folio); @@ -486,12 +486,12 @@ void free_zone_device_folio(struct folio * to clear folio->mapping. */ folio->mapping = NULL; - folio->page.pgmap->ops->page_free(folio_page(folio, 0)); + folio->pgmap->ops->page_free(folio_page(folio, 0)); - switch (folio->page.pgmap->type) { + switch (folio->pgmap->type) { case MEMORY_DEVICE_PRIVATE: case MEMORY_DEVICE_COHERENT: - put_dev_pagemap(folio->page.pgmap); + put_dev_pagemap(folio->pgmap); break; case MEMORY_DEVICE_FS_DAX: @@ -514,7 +514,7 @@ void zone_device_page_init(struct page * * Drivers shouldn't be allocating pages after calling * memunmap_pages(). */ - WARN_ON_ONCE(!percpu_ref_tryget_live(&page->pgmap->ref)); + WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref)); set_page_count(page, 1); lock_page(page); } @@ -523,7 +523,7 @@ EXPORT_SYMBOL_GPL(zone_device_page_init) #ifdef CONFIG_FS_DAX bool __put_devmap_managed_folio_refs(struct folio *folio, int refs) { - if (folio->page.pgmap->type != MEMORY_DEVICE_FS_DAX) + if (folio->pgmap->type != MEMORY_DEVICE_FS_DAX) return false; /* --- a/mm/migrate_device.c~mm-allow-compound-zone-device-pages +++ a/mm/migrate_device.c @@ -106,6 +106,7 @@ again: arch_enter_lazy_mmu_mode(); for (; addr < end; addr += PAGE_SIZE, ptep++) { + struct dev_pagemap *pgmap; unsigned long mpfn = 0, pfn; struct folio *folio; struct page *page; @@ -133,9 +134,10 @@ again: goto next; page = pfn_swap_entry_to_page(entry); + pgmap = page_pgmap(page); if (!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_PRIVATE) || - page->pgmap->owner != migrate->pgmap_owner) + pgmap->owner != migrate->pgmap_owner) goto next; mpfn = migrate_pfn(page_to_pfn(page)) | @@ -151,12 +153,13 @@ again: goto next; } page = vm_normal_page(migrate->vma, addr, pte); + pgmap = page_pgmap(page); if (page && !is_zone_device_page(page) && !(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) goto next; else if (page && is_device_coherent_page(page) && (!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_COHERENT) || - page->pgmap->owner != migrate->pgmap_owner)) + pgmap->owner != migrate->pgmap_owner)) goto next; mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE; mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0; --- a/mm/mlock.c~mm-allow-compound-zone-device-pages +++ a/mm/mlock.c @@ -368,6 +368,8 @@ static int mlock_pte_range(pmd_t *pmd, u if (is_huge_zero_pmd(*pmd)) goto out; folio = pmd_folio(*pmd); + if (folio_is_zone_device(folio)) + goto out; if (vma->vm_flags & VM_LOCKED) mlock_folio(folio); else --- a/mm/mm_init.c~mm-allow-compound-zone-device-pages +++ a/mm/mm_init.c @@ -998,7 +998,7 @@ static void __ref __init_zone_device_pag * and zone_device_data. It is a bug if a ZONE_DEVICE page is * ever freed or placed on a driver-private list. */ - page->pgmap = pgmap; + page_folio(page)->pgmap = pgmap; page->zone_device_data = NULL; /* _ Patches currently in -mm which might be from apopple@xxxxxxxxxx are fuse-fix-dax-truncate-punch_hole-fault-path.patch fs-dax-return-unmapped-busy-pages-from-dax_layout_busy_page_range.patch fs-dax-dont-skip-locked-entries-when-scanning-entries.patch fs-dax-refactor-wait-for-dax-idle-page.patch fs-dax-create-a-common-implementation-to-break-dax-layouts.patch fs-dax-always-remove-dax-page-cache-entries-when-breaking-layouts.patch fs-dax-ensure-all-pages-are-idle-prior-to-filesystem-unmount.patch fs-dax-remove-page_mapping_dax_shared-mapping-flag.patch mm-gup-remove-redundant-check-for-pci-p2pdma-page.patch mm-mm_init-move-p2pdma-page-refcount-initialisation-to-p2pdma.patch mm-allow-compound-zone-device-pages.patch mm-memory-enhance-insert_page_into_pte_locked-to-create-writable-mappings.patch mm-memory-add-vmf_insert_page_mkwrite.patch rmap-add-support-for-pud-sized-mappings-to-rmap.patch huge_memory-add-vmf_insert_folio_pud.patch huge_memory-add-vmf_insert_folio_pmd.patch mm-gup-dont-allow-foll_longterm-pinning-of-fs-dax-pages.patch fs-dax-properly-refcount-fs-dax-pages.patch device-dax-properly-refcount-device-dax-pages-when-mapping.patch