On Wed, Dec 02, 2020 at 11:05:53AM +0100, David Hildenbrand wrote: > If you take a look at generic_online_page() there are some things that > won't be done for our vmemmap pages > > 1. kernel_map_pages(page, 1 << order, 1); > > We're accessing these pages already when initializing the memmap. We > might have to explicitly map these vmemmap pages at some point. Might > require some thought. Did you test with debug pagealloc? I always try to run with all debug stuff enabled, but I definitely did not enable debug_pagealloc. I will have a look at it. > 2. totalram_pages_add(1UL << order); > > We should add/remove the vmemmap pages manually from totalram I guess. Yes, we should. That was a clear oversight. > Also, I wonder if we should add these pages to the zone managed page > count. Not quite sure yet what the benefit would be and if we care at > all. (it's a change in behavior when hotplugging memory, though, that's > why I am poining it out) I would rather not since they are not "manageable" per se. The same we do not account the pages used for bootmem stuff into zone->managed. > > Hot-remove: > > > > If the range was using memmap on memory (aka vmemmap pages), > > we construct an altmap structure so free_hugepage_table does > > the right thing and calls vmem_altmap_free instead of > > free_pagetable. > > > > You might want to update the comment in include/linux/mmzone.h for > ZONE_MOVABLE, adding this special case when onlining such memory to the > MOVABLE zone. The pages are movable for offlining only, not for allocations. Sure, let us document that. > > -int create_memory_block_devices(unsigned long start, unsigned long size) > > +int create_memory_block_devices(unsigned long start, unsigned long size, > > + unsigned long vmemmap_pages) > > { > > const unsigned long start_block_id = pfn_to_block_id(PFN_DOWN(start)); > > unsigned long end_block_id = pfn_to_block_id(PFN_DOWN(start + size)); > > @@ -645,9 +649,10 @@ int create_memory_block_devices(unsigned long start, unsigned long size) > > return -EINVAL; > > > > for (block_id = start_block_id; block_id != end_block_id; block_id++) { > > - ret = init_memory_block(block_id, MEM_OFFLINE); > > + ret = init_memory_block(block_id, MEM_OFFLINE, vmemmap_pages); > > if (ret) > > break; > > + vmemmap_pages = 0; > > We should even bail out if someone would have vmemmap_pages set when > creating more than one memory block device. That cannot really happen as we would have bailed out earlier prior to call create_memory_block_devices, by the checks in mhp_supports_memmap_on_memory. > > - for (pfn = start_pfn; pfn < end_pfn; pfn += MAX_ORDER_NR_PAGES) > > - (*online_page_callback)(pfn_to_page(pfn), MAX_ORDER - 1); > > + for (pfn = buddy_start_pfn; pfn < end_pfn; pfn += (1 << order)) { > > + order = MAX_ORDER - 1; > > + while (pfn & ((1 << order) - 1)) > > + order--; > > + (*online_page_callback)(pfn_to_page(pfn), order); > > + } > > I think it would be nicer to handle the unaligned part in a separate loop. No strong opinion here, I can certainly do it. > > /* mark all involved sections as online */ > > online_mem_sections(start_pfn, end_pfn); > > @@ -678,6 +684,7 @@ static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned lon > > pgdat->node_spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - pgdat->node_start_pfn; > > > > } > > + > > Unrelated change. Ups ^^ > > +static int get_memblock_vmemmap_pages_cb(struct memory_block *mem, void *arg) > > +{ > > + unsigned long *nr_vmemmap_pages = (unsigned long *)arg; > > + int ret = !mem->nr_vmemmap_pages; > > + > > + if (!ret) > > + *nr_vmemmap_pages += mem->nr_vmemmap_pages; > > + return ret; > > When removing in the same granularity than adding, all we care about it > the first memory block. > > > unsigned long *nr_vmemmap_pages = (unsigned long *)arg; > > *nr_vmemmap_pages = mem->nr_vmemmap_pages; > /* We only care about the first memory block. */ > return 1; Yeah, cleaner, thanks > > + /* > > + * Prepare a vmem_altmap struct if we used it at hot-add, so > > + * remove_pmd_table->free_hugepage_table does the right thing. > > + */ > > + (void)walk_memory_blocks(start, size, &nr_vmemmap_pages, > > + get_memblock_vmemmap_pages_cb); > > + if (nr_vmemmap_pages) { > > + mhp_altmap.alloc = nr_vmemmap_pages; > > + altmap = &mhp_altmap; > > + } > > + > > If someone would remove_memory() in a different granularity than > add_memory(), this would no longer work. How can we catch that > efficiently? Or at least document that this is not supported with > memmap_on_memory). Well, we can check whether the size spans more than a single memory block. And if it does, we can print a warning with pr_warn and refuse to remove the memory. We could document that "memory_hotplug.memmap_on_memory" is meant to operate with ranges that span a single memory block. And I guess that we could place that documentation under [1]. [1] Documentation/admin-guide/kernel-parameters.txt Thanks for the review David -- Oscar Salvador SUSE L3