On 25.06.19 09:52, Oscar Salvador wrote: > Hi, > > It has been while since I sent previous version [1]. > > In this version I added some feedback I got back then, like letting > the caller decide whether he wants allocating per memory block or > per memory range (patch#2), and having the chance to disable vmemmap when > users want to expose all hotpluggable memory to userspace (patch#5). > > [Testing] > > While I could test last version on powerpc, and Huawei's fellows helped me out > testing it on arm64, this time I could only test it on x86_64. > The codebase is quite the same, so I would not expect surprises. > > - x86_64: small and large memblocks (128MB, 1G and 2G) > - Kernel module that adds memory spanning multiple memblocks > and remove that memory in a different granularity. > > So far, only acpi memory hotplug uses the new flag. > The other callers can be changed depending on their needs. > > Of course, more testing and feedback is appreciated. > > [Coverletter] > > This is another step to make memory hotplug more usable. The primary > goal of this patchset is to reduce memory overhead of the hot-added > memory (at least for SPARSEMEM_VMEMMAP memory model). The current way we use > to populate memmap (struct page array) has two main drawbacks: Mental note: How will it be handled if a caller specifies "Allocate memmap from hotadded memory", but we are running under SPARSEMEM where we can't do this. > > a) it consumes an additional memory until the hotadded memory itself is > onlined and > b) memmap might end up on a different numa node which is especially true > for movable_node configuration. > > a) it is a problem especially for memory hotplug based memory "ballooning" > solutions when the delay between physical memory hotplug and the > onlining can lead to OOM and that led to introduction of hacks like auto > onlining (see 31bc3858ea3e ("memory-hotplug: add automatic onlining > policy for the newly added memory")). > > b) can have performance drawbacks. > > Another minor case is that I have seen hot-add operations failing on archs > because they were running out of order-x pages. > E.g On powerpc, in certain configurations, we use order-8 pages, > and given 64KB base pagesize, that is 16MB. > If we run out of those, we just fail the operation and we cannot add > more memory. At least for SPARSEMEM, we fallback to vmalloc() to work around this issue. I haven't looked into the populate_section_memmap() internals yet. Can you point me at the code that performs this allocation? > We could fallback to base pages as x86_64 does, but we can do better. > > One way to mitigate all these issues is to simply allocate memmap array > (which is the largest memory footprint of the physical memory hotplug) > from the hot-added memory itself. SPARSEMEM_VMEMMAP memory model allows > us to map any pfn range so the memory doesn't need to be online to be > usable for the array. See patch 3 for more details. > This feature is only usable when CONFIG_SPARSEMEM_VMEMMAP is set. > > [Overall design]: > > Implementation wise we reuse vmem_altmap infrastructure to override > the default allocator used by vmemap_populate. Once the memmap is > allocated we need a way to mark altmap pfns used for the allocation. > If MHP_MEMMAP_{DEVICE,MEMBLOCK} flag was passed, we set up the layout of the > altmap structure at the beginning of __add_pages(), and then we call > mark_vmemmap_pages(). > > The flags are either MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK, and only differ > in the way they allocate vmemmap pages within the memory blocks. > > MHP_MEMMAP_MEMBLOCK: > - With this flag, we will allocate vmemmap pages in each memory block. > This means that if we hot-add a range that spans multiple memory blocks, > we will use the beginning of each memory block for the vmemmap pages. > This strategy is good for cases where the caller wants the flexiblity > to hot-remove memory in a different granularity than when it was added. > > MHP_MEMMAP_DEVICE: > - With this flag, we will store all vmemmap pages at the beginning of > hot-added memory. > > So it is a tradeoff of flexiblity vs contigous memory. > More info on the above can be found in patch#2. > > Depending on which flag is passed (MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK), > mark_vmemmap_pages() gets called at a different stage. > With MHP_MEMMAP_MEMBLOCK, we call it once we have populated the sections > fitting in a single memblock, while with MHP_MEMMAP_DEVICE we wait until all > sections have been populated. > > mark_vmemmap_pages() marks the pages as vmemmap and sets some metadata: > > The current layout of the Vmemmap pages are: > > [Head->refcount] : Nr sections used by this altmap > [Head->private] : Nr of vmemmap pages > [Tail->freelist] : Pointer to the head page > > This is done to easy the computation we need in some places. > E.g: > > Example 1) > We hot-add 1GB on x86_64 (memory block 128MB) using > MHP_MEMMAP_DEVICE: > > head->_refcount = 8 sections > head->private = 4096 vmemmap pages > tail's->freelist = head > > Example 2) > We hot-add 1GB on x86_64 using MHP_MEMMAP_MEMBLOCK: > > [at the beginning of each memblock] > head->_refcount = 1 section > head->private = 512 vmemmap pages > tail's->freelist = head > > We have the refcount because when using MHP_MEMMAP_DEVICE, we need to know > how much do we have to defer the call to vmemmap_free(). > The thing is that the first pages of the hot-added range are used to create > the memmap mapping, so we cannot remove those first, otherwise we would blow up > when accessing the other pages. So, assuming we add_memory(1GB, MHP_MEMMAP_DEVICE) and then remove_memory(128MB) of the added memory, this will work? add_memory(8GB, MHP_MEMMAP_DEVICE) For 8GB, we will need exactly 128MB of memmap if I did the math right. So exactly one section. This section will still be marked as being online (although not pages on it are actually online)? > > What we do is that since when we hot-remove a memory-range, sections are being > removed sequentially, we wait until we hit the last section, and then we free > the hole range to vmemmap_free backwards. > We know that it is the last section because in every pass we > decrease head->_refcount, and when it reaches 0, we got our last section. > > We also have to be careful about those pages during online and offline > operations. They are simply skipped, so online will keep them > reserved and so unusable for any other purpose and offline ignores them > so they do not block the offline operation. I assume that they will still be dumped normally by user space. (as they are described by a "memory resource" and not PG_Offline) > > One thing worth mention is that vmemmap pages residing in movable memory is not a > show-stopper for that memory to be offlined/migrated away. > Vmemmap pages are just ignored in that case and they stick around until sections > referred by those vmemmap pages are hot-removed. > > [1] https://patchwork.kernel.org/cover/10875017/ > > Oscar Salvador (5): > drivers/base/memory: Remove unneeded check in > remove_memory_block_devices > mm,memory_hotplug: Introduce MHP_VMEMMAP_FLAGS > mm,memory_hotplug: Introduce Vmemmap page helpers > mm,memory_hotplug: allocate memmap from the added memory range for > sparse-vmemmap > mm,memory_hotplug: Allow userspace to enable/disable vmemmap > > arch/arm64/mm/mmu.c | 5 +- > arch/powerpc/mm/init_64.c | 7 ++ > arch/s390/mm/init.c | 6 ++ > arch/x86/mm/init_64.c | 10 +++ > drivers/acpi/acpi_memhotplug.c | 2 +- > drivers/base/memory.c | 41 +++++++++-- > drivers/dax/kmem.c | 2 +- > drivers/hv/hv_balloon.c | 2 +- > drivers/s390/char/sclp_cmd.c | 2 +- > drivers/xen/balloon.c | 2 +- > include/linux/memory_hotplug.h | 31 ++++++++- > include/linux/memremap.h | 2 +- > include/linux/page-flags.h | 34 +++++++++ > mm/compaction.c | 7 ++ > mm/memory_hotplug.c | 152 ++++++++++++++++++++++++++++++++++------- > mm/page_alloc.c | 22 +++++- > mm/page_isolation.c | 14 +++- > mm/sparse.c | 93 +++++++++++++++++++++++++ > mm/util.c | 2 + > 19 files changed, 394 insertions(+), 42 deletions(-) > Thanks for doing this, this will be very helpful :) -- Thanks, David / dhildenb