David Hildenbrand <david@xxxxxxxxxx> writes: > On 24.07.23 18:02, Aneesh Kumar K V wrote: >> On 7/24/23 9:11 PM, David Hildenbrand wrote: >>> On 24.07.23 17:16, Aneesh Kumar K V wrote: >>> >>>>> >>>>> /* >>>>> * In "forced" memmap_on_memory mode, we always align the vmemmap size up to cover >>>>> * full pageblocks. That way, we can add memory even if the vmemmap size is not properly >>>>> * aligned, however, we might waste memory. >>>>> */ >>>> >>>> I am finding that confusing. We do want things to be pageblock_nr_pages aligned both ways. >>>> With MEMMAP_ON_MEMORY_FORCE, we do that by allocating more space for memmap and >>>> in the default case we do that by making sure only memory blocks of specific size supporting >>>> that alignment can use MEMMAP_ON_MEMORY feature. >>> >>> See the usage inm hp_supports_memmap_on_memory(), I guess that makes sense then. >>> >>> But if you have any ideas on how to clarify that (terminology), I'm all ears! >>> >> >> >> I updated the commit message >> >> mm/hotplug: Support memmap_on_memory when memmap is not aligned to pageblocks >> >> Currently, memmap_on_memory feature is only supported with memory block >> sizes that result in vmemmap pages covering full page blocks. This is >> because memory onlining/offlining code requires applicable ranges to be >> pageblock-aligned, for example, to set the migratetypes properly. >> >> This patch helps to lift that restriction by reserving more pages than >> required for vmemmap space. This helps to align the start addr to be >> page block aligned with different memory block sizes. This implies the >> kernel will be reserving some pages for every memoryblock. This also >> allows the memmap on memory feature to be widely useful with different >> memory block size values. >> >> For ex: with 64K page size and 256MiB memory block size, we require 4 >> pages to map vmemmap pages, To align things correctly we end up adding a >> reserve of 28 pages. ie, for every 4096 pages 28 pages get reserved. >> >> > > Much better. > >> Also while implementing your suggestion to use memory_block_memmap_on_memory_size() >> I am finding it not really useful because in mhp_supports_memmap_on_memory() we are checking >> if remaining_size is pageblock_nr_pages aligned (dax_kmem may want to use that helper >> later). > > Let's focus on this patchset here first. > > Factoring out how manye memmap pages we actually need vs. how many pages > we need when aligning up sound very reasonable to me. > > > Can you elaborate what the problem is? > >> Also I still think altmap.reserve is easier because of the start_pfn calculation. >> (more on this below) > > Can you elaborate? Do you mean the try_remove_memory() change? > >> >> >>> [...] >>> >>>>>> + return arch_supports_memmap_on_memory(size); >>>>>> } >>>>>> /* >>>>>> @@ -1311,7 +1391,11 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) >>>>>> { >>>>>> struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) }; >>>>>> enum memblock_flags memblock_flags = MEMBLOCK_NONE; >>>>>> - struct vmem_altmap mhp_altmap = {}; >>>>>> + struct vmem_altmap mhp_altmap = { >>>>>> + .base_pfn = PHYS_PFN(res->start), >>>>>> + .end_pfn = PHYS_PFN(res->end), >>>>>> + .reserve = memory_block_align_base(resource_size(res)), >>>>> >>>>> Can you remind me why we have to set reserve here at all? >>>>> >>>>> IOW, can't we simply set >>>>> >>>>> .free = memory_block_memmap_on_memory_size(); >>>>> >>>>> end then pass >>>>> >>>>> mhp_altmap.alloc + mhp_altmap.free >>>>> >>>>> to create_memory_block_devices() instead? >>>>> >>>> >>>> But with the dax usage of altmap, altmap->reserve is what we use to reserve things to get >>>> the required alignment. One difference is where we allocate the struct page at. For this specific >>>> case it should not matter. >>>> >>>> static unsigned long __meminit vmem_altmap_next_pfn(struct vmem_altmap *altmap) >>>> { >>>> return altmap->base_pfn + altmap->reserve + altmap->alloc >>>> + altmap->align; >>>> } >>>> >>>> And other is where we online a memory block >>>> >>>> We find the start pfn using mem->altmap->alloc + mem->altmap->reserve; >>>> >>>> Considering altmap->reserve is what dax pfn_dev use, is there a reason you want to use altmap->free for this? >>> >>> "Reserve" is all about "reserving that much memory for driver usage". >>> >>> We don't care about that. We simply want vmemmap allocations coming from the pageblock(s) we set aside. Where exactly, we don't care. >>> >>>> I find it confusing to update free when we haven't allocated any altmap blocks yet. >>> >>> " >>> @reserve: pages mapped, but reserved for driver use (relative to @base)" >>> @free: free pages set aside in the mapping for memmap storage >>> @alloc: track pages consumed, private to vmemmap_populate() >>> " >>> >>> To me, that implies that we can ignore "reserve". We set @free to the aligned value and let the vmemmap get allocated from anything in there. >>> >>> free + alloc should always sum up to our set-aside pageblock(s), no? >>> >>> >> >> The difference is >> >> mhp_altmap.free = PHYS_PFN(size) - reserved blocks; >> >> ie, with 256MiB memory block size with 64K pages, we need 4 memmap pages and we reserve 28 pages for aligment. >> >> mhp_altmap.free = PHYS_PFN(size) - 28. >> >> So that 4 pages from which we are allocating the memmap pages are still counted in free page. >> >> We could all make it work by doing >> >> mhp_altmap.free = PHYS_PFN(size) - (memory_block_memmap_on_memory_size() - memory_block_memmap_size()) >> >> But is that any better than what we have now? I understand the term "reserved for driver use" is confusing for this use case. >> But it is really reserving things for required alignment. > > > Let's take a step back. > > altmap->alloc tells us how much was already allocated. > > altmap->free tells us how much memory we can allocate at max (confusing, > but see vmem_altmap_nr_free()). > > altmap->free should actually have been called differently. > > > I think it's currently even *wrong* to set free = PHYS_PFN(size). We > don't want to allocate beyond the first pageblock(s) we selected. > You are correct. The calculation of altmap.free was wrong. It was wrong upstream and also had wrong computation in the ppc64 upstream code. modified arch/powerpc/mm/init_64.c @@ -326,8 +326,7 @@ void __ref __vmemmap_free(unsigned long start, unsigned long end, start = ALIGN_DOWN(start, page_size); if (altmap) { alt_start = altmap->base_pfn; - alt_end = altmap->base_pfn + altmap->reserve + - altmap->free + altmap->alloc + altmap->align; + alt_end = altmap->base_pfn + altmap->reserve + altmap->free ; } pr_debug("vmemmap_free %lx...%lx\n", start, end); Fixing all that up the patch is now updated as below 1 file changed, 109 insertions(+), 15 deletions(-) mm/memory_hotplug.c | 124 +++++++++++++++++++++++++++++++++++++++++++++------- modified mm/memory_hotplug.c @@ -41,17 +41,91 @@ #include "internal.h" #include "shuffle.h" +enum { + MEMMAP_ON_MEMORY_DISABLE = 0, + MEMMAP_ON_MEMORY_ENABLE, + MEMMAP_ON_MEMORY_FORCE, +}; + +static int memmap_mode __read_mostly = MEMMAP_ON_MEMORY_DISABLE; + +static inline unsigned long memory_block_memmap_size(void) +{ + return PHYS_PFN(memory_block_size_bytes()) * sizeof(struct page); +} + +static inline unsigned long memory_block_memmap_on_memory_size(void) +{ + unsigned long size = memory_block_memmap_size(); + + /* + * In "forced" memmap_on_memory mode, we add extra pages to align the + * vmemmap size up to cover full pageblocks. That way, we can add memory + * even if the vmemmap size is not properly aligned, however, we might waste + * memory. + */ + if (memmap_mode == MEMMAP_ON_MEMORY_FORCE) + return ALIGN(size, PFN_PHYS(pageblock_nr_pages)); + return size; +} + #ifdef CONFIG_MHP_MEMMAP_ON_MEMORY /* * memory_hotplug.memmap_on_memory parameter */ -static bool memmap_on_memory __ro_after_init; -module_param(memmap_on_memory, bool, 0444); -MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug"); +static int set_memmap_mode(const char *val, const struct kernel_param *kp) +{ + int ret, mode; + bool enabled; + + if (sysfs_streq(val, "force") || sysfs_streq(val, "FORCE")) { + mode = MEMMAP_ON_MEMORY_FORCE; + goto matched; + } + + ret = kstrtobool(val, &enabled); + if (ret < 0) + return ret; + if (enabled) + mode = MEMMAP_ON_MEMORY_ENABLE; + else + mode = MEMMAP_ON_MEMORY_DISABLE; + +matched: + *((int *)kp->arg) = mode; + if (mode == MEMMAP_ON_MEMORY_FORCE) { + pr_info("Memory hotplug will reserve %ld pages in each memory block\n", + memory_block_memmap_on_memory_size() - memory_block_memmap_size()); + } + return 0; +} + +static int get_memmap_mode(char *buffer, const struct kernel_param *kp) +{ + if (*((int *)kp->arg) == MEMMAP_ON_MEMORY_FORCE) + return sprintf(buffer, "force\n"); + if (*((int *)kp->arg) == MEMMAP_ON_MEMORY_ENABLE) + return sprintf(buffer, "y\n"); + + return sprintf(buffer, "n\n"); +} + +static const struct kernel_param_ops memmap_mode_ops = { + .set = set_memmap_mode, + .get = get_memmap_mode, +}; +module_param_cb(memmap_on_memory, &memmap_mode_ops, &memmap_mode, 0444); +MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug\n" + "With value \"force\" it could result in memory wastage due to memmap size limitations \n" + "For example, if the memmap for a memory block requires 1 MiB, but the pageblock \n" + "size is 2 MiB, 1 MiB of hotplugged memory will be wasted. Note that there are \n" + "still cases where the feature cannot be enforced: for example, if the memmap is \n" + "smaller than a single page, or if the architecture does not support the forced \n" + "mode in all configurations. (y/n/force)"); static inline bool mhp_memmap_on_memory(void) { - return memmap_on_memory; + return memmap_mode != MEMMAP_ON_MEMORY_DISABLE; } #else static inline bool mhp_memmap_on_memory(void) @@ -1266,7 +1340,7 @@ static bool mhp_supports_memmap_on_memory(unsigned long size) { unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT; unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page); - unsigned long remaining_size = size - vmemmap_size; + unsigned long memmap_on_memory_size = memory_block_memmap_on_memory_size(); /* * Besides having arch support and the feature enabled at runtime, we @@ -1294,10 +1368,28 @@ static bool mhp_supports_memmap_on_memory(unsigned long size) * altmap as an alternative source of memory, and we do not exactly * populate a single PMD. */ - return mhp_memmap_on_memory() && - size == memory_block_size_bytes() && - IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT)) && - arch_supports_memmap_on_memory(size); + if (!mhp_memmap_on_memory() || size != memory_block_size_bytes()) + return false; + + /* + * Make sure the vmemmap allocation is fully contained + * so that we always allocate vmemmap memory from altmap area. + */ + if (!IS_ALIGNED(vmemmap_size, PAGE_SIZE)) + return false; + + /* + * start pfn should be pageblock_nr_pages aligned for correctly + * setting migrate types + */ + if (!IS_ALIGNED(memmap_on_memory_size, PFN_PHYS(pageblock_nr_pages))) + return false; + + if (memmap_on_memory_size == memory_block_size_bytes()) + /* No effective hotplugged memory doesn't make sense. */ + return false; + + return arch_supports_memmap_on_memory(size); } /* @@ -1310,7 +1402,10 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) { struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) }; enum memblock_flags memblock_flags = MEMBLOCK_NONE; - struct vmem_altmap mhp_altmap = {}; + struct vmem_altmap mhp_altmap = { + .base_pfn = PHYS_PFN(res->start), + .end_pfn = PHYS_PFN(res->end), + }; struct memory_group *group = NULL; u64 start, size; bool new_node = false; @@ -1355,8 +1450,7 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) */ if (mhp_flags & MHP_MEMMAP_ON_MEMORY) { if (mhp_supports_memmap_on_memory(size)) { - mhp_altmap.free = PHYS_PFN(size); - mhp_altmap.base_pfn = PHYS_PFN(start); + mhp_altmap.free = PHYS_PFN(memory_block_memmap_on_memory_size()); params.altmap = &mhp_altmap; } /* fallback to not using altmap */ @@ -1368,8 +1462,7 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) goto error; /* create memory block devices after memory was added */ - ret = create_memory_block_devices(start, size, mhp_altmap.alloc, - group); + ret = create_memory_block_devices(start, size, mhp_altmap.free, group); if (ret) { arch_remove_memory(start, size, NULL); goto error; @@ -2090,7 +2183,8 @@ static int __ref try_remove_memory(u64 start, u64 size) * right thing if we used vmem_altmap when hot-adding * the range. */ - mhp_altmap.alloc = nr_vmemmap_pages; + mhp_altmap.base_pfn = PHYS_PFN(start); + mhp_altmap.free = nr_vmemmap_pages; altmap = &mhp_altmap; } }