On 28.03.19 16:09, David Hildenbrand wrote: > On 28.03.19 14:43, Oscar Salvador wrote: >> Hi, >> >> since last two RFCs were almost unnoticed (thanks David for the feedback), >> I decided to re-work some parts to make it more simple and give it a more >> testing, and drop the RFC, to see if it gets more attention. >> I also added David's feedback, so now all users of add_memory/__add_memory/ >> add_memory_resource can specify whether they want to use this feature or not. > > Terrific, I will also definetly try to make use of that in the next > virito-mem prototype (looks like I'll finally have time to look into it > again). > >> I also fixed some compilation issues when CONFIG_SPARSEMEM_VMEMMAP is not set. >> >> [Testing] >> >> Testing has been carried out on the following platforms: >> >> - x86_64 (small and big memblocks) >> - powerpc >> - arm64 (Huawei's fellows) >> >> I plan to test it on Xen and Hyper-V, but for now those two will not be >> using this feature, and neither DAX/pmem. > > I think doing it step by step is the right approach. Less likely to > break stuff. > >> >> Of course, if this does not find any strong objection, my next step is to >> work on enabling this on Xen/Hyper-V. >> >> [Coverletter] >> >> This is another step to make the memory hotplug more usable. The primary >> goal of this patchset is to reduce memory overhead of the hot added >> memory (at least for SPARSE_VMEMMAP memory model). The current way we use >> to populate memmap (struct page array) has two main drawbacks: >> >> a) it consumes an additional memory until the hotadded memory itself is >> onlined and >> b) memmap might end up on a different numa node which is especially true >> for movable_node configuration. >> >> a) is problem especially for memory hotplug based memory "ballooning" >> solutions when the delay between physical memory hotplug and the >> onlining can lead to OOM and that led to introduction of hacks like auto >> onlining (see 31bc3858ea3e ("memory-hotplug: add automatic onlining >> policy for the newly added memory")). >> >> b) can have performance drawbacks. >> >> I have also seen hot-add operations failing on archs because they >> were running out of order-x pages. >> E.g On powerpc, in certain configurations, we use order-8 pages, >> and given 64KB base pagesize, that is 16MB. >> If we run out of those, we just fail the operation and we cannot add >> more memory. >> We could fallback to base pages as x86_64 does, but we can do better. >> >> One way to mitigate all these issues is to simply allocate memmap array >> (which is the largest memory footprint of the physical memory hotplug) >> from the hotadded memory itself. VMEMMAP memory model allows us to map >> any pfn range so the memory doesn't need to be online to be usable >> for the array. See patch 3 for more details. In short I am reusing an >> existing vmem_altmap which wants to achieve the same thing for nvdim >> device memory. >> >> There is also one potential drawback, though. If somebody uses memory >> hotplug for 1G (gigantic) hugetlb pages then this scheme will not work >> for them obviously because each memory block will contain reserved >> area. Large x86 machines will use 2G memblocks so at least one 1G page >> will be available but this is still not 2G... >> >> If that is a problem, we can always configure a fallback strategy to >> use the current scheme. >> >> Since this only works when CONFIG_VMEMMAP_ENABLED is set, >> we do check for it before setting the flag that allows use >> to use the feature, no matter if the user wanted it. >> >> [Overall design]: >> >> Let us say we hot-add 2GB of memory on a x86_64 (memblock size = 128M). >> That is: >> >> - 16 sections >> - 524288 pages >> - 8192 vmemmap pages (out of those 524288. We spend 512 pages for each section) >> >> The range of pages is: 0xffffea0004000000 - 0xffffea0006000000 >> The vmemmap range is: 0xffffea0004000000 - 0xffffea0004080000 >> >> 0xffffea0004000000 is the head vmemmap page (first page), while all the others >> are "tails". >> >> We keep the following information in it: >> >> - Head page: >> - head->_refcount: number of sections >> - head->private : number of vmemmap pages >> - Tail page: >> - tail->freelist : pointer to the head >> >> This is done because it eases the work in cases where we have to compute the >> number of vmemmap pages to know how much do we have to skip etc, and to keep >> the right accounting to present_pages. >> >> When we want to hot-remove the range, we need to be careful because the first >> pages of that range, are used for the memmap maping, so if we remove those >> first, we would blow up while accessing the others later on. >> For that reason we keep the number of sections in head->_refcount, to know how >> much do we have to defer the free up. >> >> Since in a hot-remove operation, sections are being removed sequentially, the >> approach taken here is that every time we hit free_section_memmap(), we decrease >> the refcount of the head. >> When it reaches 0, we know that we hit the last section, so we call >> vmemmap_free() for the whole memory-range in backwards, so we make sure that >> the pages used for the mapping will be latest to be freed up. >> >> Vmemmap pages are charged to spanned/present_paged, but not to manages_pages. >> > > I guess one important thing to mention is that it is no longer possible > to remove memory in a different granularity it was added. I slightly > remember that ACPI code sometimes "reuses" parts of already added > memory. We would have to validate that this can indeed not be an issue. > > drivers/acpi/acpi_memhotplug.c: > > result = __add_memory(node, info->start_addr, info->length); > if (result && result != -EEXIST) > continue; > > What would happen when removing this dimm (->remove_memory()) > > > Also have a look at > > arch/powerpc/platforms/powernv/memtrace.c > > I consider it evil code. It will simply try to offline+unplug *some* > memory it finds in *some granularity*. Not sure if this might be > problematic- > > Would there be any "safety net" for adding/removing memory in different > granularities? > Correct me if I am wrong. I think I was confused - vmemmap data is still allocated *per memory block*, not for the whole added memory, correct? -- Thanks, David / dhildenb