Re: [PATCH v8 4/8] mm,memory_hotplug: Allocate memmap from the added memory range

David Hildenbrand <david@xxxxxxxxxx> · Fri, 16 Apr 2021 12:33:34 +0200

On 16.04.21 12:21, Oscar Salvador wrote:
Physical memory hotadd has to allocate a memmap (struct page array) for
the newly added memory section. Currently, alloc_pages_node() is used
for those allocations.

This has some disadvantages:
  a) an existing memory is consumed for that purpose
     (eg: ~2MB per 128MB memory section on x86_64)
  b) if the whole node is movable then we have off-node struct pages
     which has performance drawbacks.
  c) It might be there are no PMD_ALIGNED chunks so memmap array gets
     populated with base pages.

This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.

Vmemap page tables can map arbitrary memory.
That means that we can simply use the beginning of each memory section and
map struct pages there.
struct pages which back the allocated space then just need to be treated
carefully.

Implementation wise we will reuse vmem_altmap infrastructure to override
the default allocator used by __populate_section_memmap.
Part of the implementation also relies on memory_block structure gaining
a new field which specifies the number of vmemmap_pages at the beginning.
This patch also introduces the following functions:

  - mhp_init_memmap_on_memory:
		       Initializes vmemmap pages by calling move_pfn_range_to_zone(),
		       calls kasan_add_zero_shadow(), and onlines as many sections
		       as vmemmap pages fully span.
  - mhp_deinit_memmap_on_memory:
		       Undoes what mhp_init_memmap_on_memory.

The new function memory_block_online() calls mhp_init_memmap_on_memory() before
doing the actual online_pages(). Should online_pages() fail, we clean up
by calling mhp_deinit_memmap_on_memory().
Adjusting of present_pages is done at the end once we know that online_pages()
succedeed.

On offline, memory_block_offline() needs to unaccount vmemmap pages from
present_pages() before calling offline_pages().
This is necessary because offline_pages() tears down some structures based
on the fact whether the node or the zone become empty.
If offline_pages() fails, we account back vmemmap pages.
If it succeeds, we call mhp_deinit_memmap_on_memory().

Hot-remove:

  We need to be careful when removing memory, as adding and
  removing memory needs to be done with the same granularity.
  To check that this assumption is not violated, we check the
  memory range we want to remove and if a) any memory block has
  vmemmap pages and b) the range spans more than a single memory
  block, we scream out loud and refuse to proceed.

  If all is good and the range was using memmap on memory (aka vmemmap pages),
  we construct an altmap structure so free_hugepage_table does the right
  thing and calls vmem_altmap_free instead of free_pagetable.

Signed-off-by: Oscar Salvador <osalvador@xxxxxxx>
---
  drivers/base/memory.c          |  75 ++++++++++++++++--
  include/linux/memory.h         |   8 +-
  include/linux/memory_hotplug.h |  17 +++-
  include/linux/memremap.h       |   2 +-
  include/linux/mmzone.h         |   7 +-
  mm/Kconfig                     |   5 ++
  mm/memory_hotplug.c            | 171 ++++++++++++++++++++++++++++++++++++++---
  mm/sparse.c                    |   2 -
  8 files changed, 265 insertions(+), 22 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index f209925a5d4e..179857d53982 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -173,16 +173,76 @@ static int memory_block_online(struct memory_block *mem)
  {
  	unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
  	unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
+	unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
+	struct zone *zone;
+	int ret;
+
+	zone = mhp_get_target_zone(start_pfn, nr_pages, mem->nid,
+				   mem->online_type);
+
+	/*
+	 * Although vmemmap pages have a different lifecycle than the pages
+	 * they describe (they remain until the memory is unplugged), doing
+	 * its initialization and accounting at hot-{online,offline} stage

s/its/their/

s/hot-{online,offline}/memory onlining/offlining stage/

+	 * simplifies things a lot
+	 */
+	if (nr_vmemmap_pages) {
+		ret = mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages, zone);
+		if (ret)
+			return ret;
+	}
+
+	ret = online_pages(start_pfn + nr_vmemmap_pages,
+			   nr_pages - nr_vmemmap_pages, zone);
+	if (ret) {
+		if (nr_vmemmap_pages)
+			mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages);
+		return ret;
+	}
+
+	/*
+	 * Account once onlining succeeded. If the page was unpopulated, it is

s/page/zone/

+	 * now already properly populated.
+	 */
+	if (nr_vmemmap_pages)
+		adjust_present_page_count(zone, nr_vmemmap_pages);
  
-	return online_pages(start_pfn, nr_pages, mem->online_type, mem->nid);
+	return ret;
  }
  
  static int memory_block_offline(struct memory_block *mem)
  {
  	unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
  	unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
+	unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
+	struct zone *zone;
+	int ret;
+
+	zone = page_zone(pfn_to_page(start_pfn));
  
-	return offline_pages(start_pfn, nr_pages);
+	/*
+	 * Unaccount before offlining, such that unpopulated zone and kthreads
+	 * can properly be torn down in offline_pages().
+	 */
+	if (nr_vmemmap_pages)
+		adjust_present_page_count(zone, -nr_vmemmap_pages);
+
+	ret = offline_pages(start_pfn + nr_vmemmap_pages,
+			    nr_pages - nr_vmemmap_pages);
+	if (ret) {
+		/* offline_pages() failed. Account back. */
+		if (nr_vmemmap_pages)
+			adjust_present_page_count(zone, nr_vmemmap_pages);
+		return ret;
+	}
+
+	/*
+	 * Re-adjust present pages if offline_pages() fails.
+	 */

That comment is stale. I'd just drop it.

+	if (nr_vmemmap_pages)
+		mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages);
+
+	return ret;
  }

[...]

-static void adjust_present_page_count(struct zone *zone, long nr_pages)
+/*
+ * This function should only be called by memory_block_{online,offline},
+ * and {online,offline}_pages.
+ */
+void adjust_present_page_count(struct zone *zone, long nr_pages)
  {
  	unsigned long flags;
  
@@ -839,12 +850,64 @@ static void adjust_present_page_count(struct zone *zone, long nr_pages)
  	pgdat_resize_unlock(zone->zone_pgdat, &flags);
  }
  
-int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
-		       int online_type, int nid)
+struct zone *mhp_get_target_zone(unsigned long pfn, unsigned long nr_pages,
+				 int nid, int online_type)
+{
+	return zone_for_pfn_range(online_type, nid, pfn, nr_pages);
+}
+

Oh, you can just use zone_for_pfn_range() directly for now. No need for 
mhp_get_target_zone(). Sorry for not realizing this.

+int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
+			      struct zone *zone)
+{
+	unsigned long end_pfn = pfn + nr_pages;
+	int ret;
+
+	/*
+	 * Initialize vmemmap pages with the corresponding node, zone links set.
+	 */
+	move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE);
+
+	ret = kasan_add_zero_shadow(__va(PFN_PHYS(pfn)), PFN_PHYS(nr_pages));
+	if (ret) {
+		remove_pfn_range_from_zone(zone, pfn, nr_pages);
+		return ret;
+	}

IIRC, we have to add the zero shadow first, before touching the memory. 
This is also what mm/memremap.c does.

In mhp_deinit_memmap_on_memory(), you already remove in the proper 
(reversed) order :)

+
+int __ref online_pages(unsigned long pfn, unsigned long nr_pages, struct zone *zone)
  {
  	unsigned long flags;
-	struct zone *zone;
  	int need_zonelists_rebuild = 0;
+	int nid;
  	int ret;
  	struct memory_notify arg;
  
@@ -860,8 +923,9 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
  
  	mem_hotplug_begin();
  
+	nid = zone_to_nid(zone);

I'd do that right above

const int nid = zone_to_nid(zone);

[...]

--
Thanks,

David / dhildenb