On 26.06.19 10:03, Oscar Salvador wrote: > On Tue, Jun 25, 2019 at 10:25:48AM +0200, David Hildenbrand wrote: >>> [Coverletter] >>> >>> This is another step to make memory hotplug more usable. The primary >>> goal of this patchset is to reduce memory overhead of the hot-added >>> memory (at least for SPARSEMEM_VMEMMAP memory model). The current way we use >>> to populate memmap (struct page array) has two main drawbacks: > > First off, thanks for looking into this :-) Thanks for working on this ;) > >> >> Mental note: How will it be handled if a caller specifies "Allocate >> memmap from hotadded memory", but we are running under SPARSEMEM where >> we can't do this. > > In add_memory_resource(), we have a call to mhp_check_correct_flags(), which is > in charge of checking if the flags passed are compliant with our configuration > among other things. > It also checks if both flags were passed (_MEMBLOCK|_DEVICE). > > If a) any of the flags were specified and we are not on CONFIG_SPARSEMEM_VMEMMAP, > b) the flags are colliding with each other or c) the flags just do not make sense, > we print out a warning and drop the flags to 0, so we just ignore them. > > I just realized that I can adjust the check even more (something for the next > version). > > But to answer your question, flags are ignored under !CONFIG_SPARSEMEM_VMEMMAP. So it is indeed a hint only. > >> >>> >>> a) it consumes an additional memory until the hotadded memory itself is >>> onlined and >>> b) memmap might end up on a different numa node which is especially true >>> for movable_node configuration. >>> >>> a) it is a problem especially for memory hotplug based memory "ballooning" >>> solutions when the delay between physical memory hotplug and the >>> onlining can lead to OOM and that led to introduction of hacks like auto >>> onlining (see 31bc3858ea3e ("memory-hotplug: add automatic onlining >>> policy for the newly added memory")). >>> >>> b) can have performance drawbacks. >>> >>> Another minor case is that I have seen hot-add operations failing on archs >>> because they were running out of order-x pages. >>> E.g On powerpc, in certain configurations, we use order-8 pages, >>> and given 64KB base pagesize, that is 16MB. >>> If we run out of those, we just fail the operation and we cannot add >>> more memory. >> >> At least for SPARSEMEM, we fallback to vmalloc() to work around this >> issue. I haven't looked into the populate_section_memmap() internals >> yet. Can you point me at the code that performs this allocation? > > Yes, on SPARSEMEM we first try to allocate the pages physical configuous, and > then fallback to vmalloc. > This is because on CONFIG_SPARSEMEM memory model, the translations pfn_to_page/ > page_to_pfn do not expect the memory to be contiguous. > > But that is not the case on CONFIG_SPARSEMEM_VMEMMAP. > We do expect the memory to be physical contigous there, that is why a simply > pfn_to_page/page_to_pfn is a matter of adding/substracting vmemmap/pfn. Yeas, I explored that last week but didn't figure out where the actual vmmap population code resided - thanks :) > > Powerpc code is at: > > https://elixir.bootlin.com/linux/v5.2-rc6/source/arch/powerpc/mm/init_64.c#L175 > > > >> So, assuming we add_memory(1GB, MHP_MEMMAP_DEVICE) and then >> remove_memory(128MB) of the added memory, this will work? > > No, MHP_MEMMAP_DEVICE is meant to be used when hot-adding and hot-removing work > in the same granularity. > This is because all memmap pages will be stored at the beginning of the memory > range. > Allowing hot-removing in a different granularity on MHP_MEMMAP_DEVICE would imply > a lot of extra work. > For example, we would have to parse the vmemmap-head of the hot-removed range, > and punch a hole in there to clear the vmemmap pages, and then be very carefull > when deleting those pagetables. > > So I followed Michal's advice, and I decided to let the caller specify if he > either wants to allocate per memory block or per hot-added range(device). > Where per memory block, allows us to do: > > add_memory(1GB, MHP_MEMMAP_MEMBLOCK) > remove_memory(128MB) Back then, I already mentioned that we might have some users that remove_memory() they never added in a granularity it wasn't added. My concerns back then were never fully sorted out. arch/powerpc/platforms/powernv/memtrace.c - Will remove memory in memory block size chunks it never added - What if that memory resides on a DIMM added via MHP_MEMMAP_DEVICE? Will it at least bail out? Or simply break? IOW: I am not yet 100% convinced that MHP_MEMMAP_DEVICE is save to be introduced. -- Thanks, David / dhildenb