The patch titled Subject: mm/memory_hotplug: initialize memmap of !ZONE_DEVICE with PageOffline() instead of PageReserved() has been added to the -mm mm-unstable branch. Its filename is mm-memory_hotplug-initialize-memmap-of-zone_device-with-pageoffline-instead-of-pagereserved.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-memory_hotplug-initialize-memmap-of-zone_device-with-pageoffline-instead-of-pagereserved.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: David Hildenbrand <david@xxxxxxxxxx> Subject: mm/memory_hotplug: initialize memmap of !ZONE_DEVICE with PageOffline() instead of PageReserved() Date: Fri, 7 Jun 2024 11:09:37 +0200 We currently initialize the memmap such that PG_reserved is set and the refcount of the page is 1. In virtio-mem code, we have to manually clear that PG_reserved flag to make memory offlining with partially hotplugged memory blocks possible: has_unmovable_pages() would otherwise bail out on such pages. We want to avoid PG_reserved where possible and move to typed pages instead. Further, we want to further enlighten memory offlining code about PG_offline: offline pages in an online memory section. One example is handling managed page count adjustments in a cleaner way during memory offlining. So let's initialize the pages with PG_offline instead of PG_reserved. generic_online_page()->__free_pages_core() will now clear that flag before handing that memory to the buddy. Note that the page refcount is still 1 and would forbid offlining of such memory except when special care is take during GOING_OFFLINE as currently only implemented by virtio-mem. With this change, we can now get non-PageReserved() pages in the XEN balloon list. From what I can tell, that can already happen via decrease_reservation(), so that should be fine. HV-balloon should not really observe a change: partial online memory blocks still cannot get surprise-offlined, because the refcount of these PageOffline() pages is 1. Update virtio-mem, HV-balloon and XEN-balloon code to be aware that hotplugged pages are now PageOffline() instead of PageReserved() before they are handed over to the buddy. We'll leave the ZONE_DEVICE case alone for now. Link: https://lkml.kernel.org/r/20240607090939.89524-3-david@xxxxxxxxxx Signed-off-by: David Hildenbrand <david@xxxxxxxxxx> Cc: Alexander Potapenko <glider@xxxxxxxxxx> Cc: Dexuan Cui <decui@xxxxxxxxxxxxx> Cc: Dmitry Vyukov <dvyukov@xxxxxxxxxx> Cc: Eugenio Pérez <eperezma@xxxxxxxxxx> Cc: Haiyang Zhang <haiyangz@xxxxxxxxxxxxx> Cc: Jason Wang <jasowang@xxxxxxxxxx> Cc: Juergen Gross <jgross@xxxxxxxx> Cc: "K. Y. Srinivasan" <kys@xxxxxxxxxxxxx> Cc: Marco Elver <elver@xxxxxxxxxx> Cc: Michael S. Tsirkin <mst@xxxxxxxxxx> Cc: Mike Rapoport (IBM) <rppt@xxxxxxxxxx> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@xxxxxxxx> Cc: Oscar Salvador <osalvador@xxxxxxx> Cc: Stefano Stabellini <sstabellini@xxxxxxxxxx> Cc: Wei Liu <wei.liu@xxxxxxxxxx> Cc: Xuan Zhuo <xuanzhuo@xxxxxxxxxxxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- drivers/hv/hv_balloon.c | 5 ++--- drivers/virtio/virtio_mem.c | 18 ++++++++++++------ drivers/xen/balloon.c | 9 +++++++-- include/linux/page-flags.h | 12 +++++------- mm/memory_hotplug.c | 16 ++++++++++------ mm/mm_init.c | 10 ++++++++-- mm/page_alloc.c | 32 +++++++++++++++++++++++--------- 7 files changed, 67 insertions(+), 35 deletions(-) --- a/drivers/hv/hv_balloon.c~mm-memory_hotplug-initialize-memmap-of-zone_device-with-pageoffline-instead-of-pagereserved +++ a/drivers/hv/hv_balloon.c @@ -693,9 +693,8 @@ static void hv_page_online_one(struct hv if (!PageOffline(pg)) __SetPageOffline(pg); return; - } - if (PageOffline(pg)) - __ClearPageOffline(pg); + } else if (!PageOffline(pg)) + return; /* This frame is currently backed; online the page. */ generic_online_page(pg, 0); --- a/drivers/virtio/virtio_mem.c~mm-memory_hotplug-initialize-memmap-of-zone_device-with-pageoffline-instead-of-pagereserved +++ a/drivers/virtio/virtio_mem.c @@ -1146,12 +1146,16 @@ static void virtio_mem_set_fake_offline( for (; nr_pages--; pfn++) { struct page *page = pfn_to_page(pfn); - __SetPageOffline(page); - if (!onlined) { + if (!onlined) + /* + * Pages that have not been onlined yet were initialized + * to PageOffline(). Remember that we have to route them + * through generic_online_page(). + */ SetPageDirty(page); - /* FIXME: remove after cleanups */ - ClearPageReserved(page); - } + else + __SetPageOffline(page); + VM_WARN_ON_ONCE(!PageOffline(page)); } page_offline_end(); } @@ -1166,9 +1170,11 @@ static void virtio_mem_clear_fake_offlin for (; nr_pages--; pfn++) { struct page *page = pfn_to_page(pfn); - __ClearPageOffline(page); if (!onlined) + /* generic_online_page() will clear PageOffline(). */ ClearPageDirty(page); + else + __ClearPageOffline(page); } } --- a/drivers/xen/balloon.c~mm-memory_hotplug-initialize-memmap-of-zone_device-with-pageoffline-instead-of-pagereserved +++ a/drivers/xen/balloon.c @@ -146,7 +146,8 @@ static DECLARE_WAIT_QUEUE_HEAD(balloon_w /* balloon_append: add the given page to the balloon. */ static void balloon_append(struct page *page) { - __SetPageOffline(page); + if (!PageOffline(page)) + __SetPageOffline(page); /* Lowmem is re-populated first, so highmem pages go at list tail. */ if (PageHighMem(page)) { @@ -412,7 +413,11 @@ static enum bp_state increase_reservatio xenmem_reservation_va_mapping_update(1, &page, &frame_list[i]); - /* Relinquish the page back to the allocator. */ + /* + * Relinquish the page back to the allocator. Note that + * some pages, including ones added via xen_online_page(), might + * not be marked reserved; free_reserved_page() will handle that. + */ free_reserved_page(page); } --- a/include/linux/page-flags.h~mm-memory_hotplug-initialize-memmap-of-zone_device-with-pageoffline-instead-of-pagereserved +++ a/include/linux/page-flags.h @@ -30,16 +30,11 @@ * - Pages falling into physical memory gaps - not IORESOURCE_SYSRAM. Trying * to read/write these pages might end badly. Don't touch! * - The zero page(s) - * - Pages not added to the page allocator when onlining a section because - * they were excluded via the online_page_callback() or because they are - * PG_hwpoison. * - Pages allocated in the context of kexec/kdump (loaded kernel image, * control pages, vmcoreinfo) * - MMIO/DMA pages. Some architectures don't allow to ioremap pages that are * not marked PG_reserved (as they might be in use by somebody else who does * not respect the caching strategy). - * - Pages part of an offline section (struct pages of offline sections should - * not be trusted as they will be initialized when first onlined). * - MCA pages on ia64 * - Pages holding CPU notes for POWER Firmware Assisted Dump * - Device memory (e.g. PMEM, DAX, HMM) @@ -1021,6 +1016,10 @@ PAGE_TYPE_OPS(Buddy, buddy, buddy) * The content of these pages is effectively stale. Such pages should not * be touched (read/write/dump/save) except by their owner. * + * When a memory block gets onlined, all pages are initialized with a + * refcount of 1 and PageOffline(). generic_online_page() will + * take care of clearing PageOffline(). + * * If a driver wants to allow to offline unmovable PageOffline() pages without * putting them back to the buddy, it can do so via the memory notifier by * decrementing the reference count in MEM_GOING_OFFLINE and incrementing the @@ -1028,8 +1027,7 @@ PAGE_TYPE_OPS(Buddy, buddy, buddy) * pages (now with a reference count of zero) are treated like free pages, * allowing the containing memory block to get offlined. A driver that * relies on this feature is aware that re-onlining the memory block will - * require to re-set the pages PageOffline() and not giving them to the - * buddy via online_page_callback_t. + * require not giving them to the buddy via generic_online_page(). * * There are drivers that mark a page PageOffline() and expect there won't be * any further access to page content. PFN walkers that read content of random --- a/mm/memory_hotplug.c~mm-memory_hotplug-initialize-memmap-of-zone_device-with-pageoffline-instead-of-pagereserved +++ a/mm/memory_hotplug.c @@ -734,7 +734,7 @@ static inline void section_taint_zone_de /* * Associate the pfn range with the given zone, initializing the memmaps * and resizing the pgdat/zone data to span the added pages. After this - * call, all affected pages are PG_reserved. + * call, all affected pages are PageOffline(). * * All aligned pageblocks are initialized to the specified migratetype * (usually MIGRATE_MOVABLE). Besides setting the migratetype, no related @@ -1100,8 +1100,12 @@ int mhp_init_memmap_on_memory(unsigned l move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE); - for (i = 0; i < nr_pages; i++) - SetPageVmemmapSelfHosted(pfn_to_page(pfn + i)); + for (i = 0; i < nr_pages; i++) { + struct page *page = pfn_to_page(pfn + i); + + __ClearPageOffline(page); + SetPageVmemmapSelfHosted(page); + } /* * It might be that the vmemmap_pages fully span sections. If that is @@ -1959,9 +1963,9 @@ int __ref offline_pages(unsigned long st * Don't allow to offline memory blocks that contain holes. * Consequently, memory blocks with holes can never get onlined * via the hotplug path - online_pages() - as hotplugged memory has - * no holes. This way, we e.g., don't have to worry about marking - * memory holes PG_reserved, don't need pfn_valid() checks, and can - * avoid using walk_system_ram_range() later. + * no holes. This way, we don't have to worry about memory holes, + * don't need pfn_valid() checks, and can avoid using + * walk_system_ram_range() later. */ walk_system_ram_range(start_pfn, nr_pages, &system_ram_pages, count_system_ram_pages_cb); --- a/mm/mm_init.c~mm-memory_hotplug-initialize-memmap-of-zone_device-with-pageoffline-instead-of-pagereserved +++ a/mm/mm_init.c @@ -892,8 +892,14 @@ void __meminit memmap_init_range(unsigne page = pfn_to_page(pfn); __init_single_page(page, pfn, zone, nid); - if (context == MEMINIT_HOTPLUG) - __SetPageReserved(page); + if (context == MEMINIT_HOTPLUG) { +#ifdef CONFIG_ZONE_DEVICE + if (zone == ZONE_DEVICE) + __SetPageReserved(page); + else +#endif + __SetPageOffline(page); + } /* * Usually, we want to mark the pageblock MIGRATE_MOVABLE, --- a/mm/page_alloc.c~mm-memory_hotplug-initialize-memmap-of-zone_device-with-pageoffline-instead-of-pagereserved +++ a/mm/page_alloc.c @@ -1225,18 +1225,23 @@ void __free_pages_core(struct page *page * When initializing the memmap, __init_single_page() sets the refcount * of all pages to 1 ("allocated"/"not free"). We have to set the * refcount of all involved pages to 0. + * + * Note that hotplugged memory pages are initialized to PageOffline(). + * Pages freed from memblock might be marked as reserved. */ - prefetchw(p); - for (loop = 0; loop < (nr_pages - 1); loop++, p++) { - prefetchw(p + 1); - __ClearPageReserved(p); - set_page_count(p, 0); - } - __ClearPageReserved(p); - set_page_count(p, 0); - if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG) && unlikely(context == MEMINIT_HOTPLUG)) { + prefetchw(p); + for (loop = 0; loop < (nr_pages - 1); loop++, p++) { + prefetchw(p + 1); + VM_WARN_ON_ONCE(PageReserved(p)); + __ClearPageOffline(p); + set_page_count(p, 0); + } + VM_WARN_ON_ONCE(PageReserved(p)); + __ClearPageOffline(p); + set_page_count(p, 0); + /* * Freeing the page with debug_pagealloc enabled will try to * unmap it; some archs don't like double-unmappings, so @@ -1245,6 +1250,15 @@ void __free_pages_core(struct page *page debug_pagealloc_map_pages(page, nr_pages); adjust_managed_page_count(page, nr_pages); } else { + prefetchw(p); + for (loop = 0; loop < (nr_pages - 1); loop++, p++) { + prefetchw(p + 1); + __ClearPageReserved(p); + set_page_count(p, 0); + } + __ClearPageReserved(p); + set_page_count(p, 0); + /* memblock adjusts totalram_pages() ahead of time. */ atomic_long_add(nr_pages, &page_zone(page)->managed_pages); } _ Patches currently in -mm which might be from david@xxxxxxxxxx are revert-mm-init_mlocked_on_free_v3.patch mm-memory-move-page_count-check-into-validate_page_before_insert.patch mm-memory-cleanly-support-zeropage-in-vm_insert_page-vm_map_pages-and-vmf_insert_mixed.patch mm-rmap-sanity-check-that-zeropages-are-not-passed-to-rmap.patch mm-update-_mapcount-and-page_type-documentation.patch mm-allow-reuse-of-the-lower-16-bit-of-the-page-type-with-an-actual-type.patch mm-zsmalloc-use-a-proper-page-type.patch mm-page_alloc-clear-pagebuddy-using-__clearpagebuddy-for-bad-pages.patch mm-filemap-reinitialize-folio-_mapcount-directly.patch mm-mm_init-initialize-page-_mapcount-directly-in-__init_single_page.patch fs-proc-task_mmu-indicate-pm_file-for-pmd-mapped-file-thp.patch fs-proc-task_mmu-dont-indicate-pm_mmap_exclusive-without-pm_present.patch fs-proc-task_mmu-properly-detect-pm_mmap_exclusive-per-page-of-pmd-mapped-thps.patch fs-proc-task_mmu-account-non-present-entries-as-maybe-shared-but-no-idea-how-often.patch fs-proc-move-page_mapcount-to-fs-proc-internalh.patch documentation-admin-guide-mm-pagemaprst-drop-using-pagemap-to-do-something-useful.patch mm-pass-meminit_context-to-__free_pages_core.patch mm-pass-meminit_context-to-__free_pages_core-fix.patch mm-memory_hotplug-initialize-memmap-of-zone_device-with-pageoffline-instead-of-pagereserved.patch mm-memory_hotplug-skip-adjust_managed_page_count-for-pageoffline-pages-when-offlining.patch