Re: [PATCH V6 3/3] arm64/mm: Enable memory hot remove

Mark Rutland <mark.rutland@xxxxxxx> · Mon, 24 Jun 2019 17:52:01 +0100

On Fri, Jun 21, 2019 at 03:35:53PM +0100, Steve Capper wrote:
> Hi Anshuman,
> 
> On Wed, Jun 19, 2019 at 09:47:40AM +0530, Anshuman Khandual wrote:
> > The arch code for hot-remove must tear down portions of the linear map and
> > vmemmap corresponding to memory being removed. In both cases the page
> > tables mapping these regions must be freed, and when sparse vmemmap is in
> > use the memory backing the vmemmap must also be freed.
> > 
> > This patch adds a new remove_pagetable() helper which can be used to tear
> > down either region, and calls it from vmemmap_free() and
> > ___remove_pgd_mapping(). The sparse_vmap argument determines whether the
> > backing memory will be freed.
> > 
> > remove_pagetable() makes two distinct passes over the kernel page table.
> > In the first pass it unmaps, invalidates applicable TLB cache and frees
> > backing memory if required (vmemmap) for each mapped leaf entry. In the
> > second pass it looks for empty page table sections whose page table page
> > can be unmapped, TLB invalidated and freed.
> > 
> > While freeing intermediate level page table pages bail out if any of its
> > entries are still valid. This can happen for partially filled kernel page
> > table either from a previously attempted failed memory hot add or while
> > removing an address range which does not span the entire page table page
> > range.
> > 
> > The vmemmap region may share levels of table with the vmalloc region.
> > There can be conflicts between hot remove freeing page table pages with
> > a concurrent vmalloc() walking the kernel page table. This conflict can
> > not just be solved by taking the init_mm ptl because of existing locking
> > scheme in vmalloc(). Hence unlike linear mapping, skip freeing page table
> > pages while tearing down vmemmap mapping.
> > 
> > While here update arch_add_memory() to handle __add_pages() failures by
> > just unmapping recently added kernel linear mapping. Now enable memory hot
> > remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE.
> > 
> > This implementation is overall inspired from kernel page table tear down
> > procedure on X86 architecture.
> > 
> > Acked-by: David Hildenbrand <david@xxxxxxxxxx>
> > Signed-off-by: Anshuman Khandual <anshuman.khandual@xxxxxxx>
> > ---
> 
> FWIW:
> Acked-by: Steve Capper <steve.capper@xxxxxxx>
> 
> One minor comment below though.
> 
> >  arch/arm64/Kconfig  |   3 +
> >  arch/arm64/mm/mmu.c | 290 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> >  2 files changed, 284 insertions(+), 9 deletions(-)
> > 
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index 6426f48..9375f26 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -270,6 +270,9 @@ config HAVE_GENERIC_GUP
> >  config ARCH_ENABLE_MEMORY_HOTPLUG
> >  	def_bool y
> >  
> > +config ARCH_ENABLE_MEMORY_HOTREMOVE
> > +	def_bool y
> > +
> >  config SMP
> >  	def_bool y
> >  
> > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > index 93ed0df..9e80a94 100644
> > --- a/arch/arm64/mm/mmu.c
> > +++ b/arch/arm64/mm/mmu.c
> > @@ -733,6 +733,250 @@ int kern_addr_valid(unsigned long addr)
> >  
> >  	return pfn_valid(pte_pfn(pte));
> >  }
> > +
> > +#ifdef CONFIG_MEMORY_HOTPLUG
> > +static void free_hotplug_page_range(struct page *page, size_t size)
> > +{
> > +	WARN_ON(!page || PageReserved(page));
> > +	free_pages((unsigned long)page_address(page), get_order(size));
> > +}
> 
> We are dealing with power of 2 number of pages, it makes a lot more
> sense (to me) to replace the size parameter with order.
> 
> Also, all the callers are for known compile-time sizes, so we could just
> translate the size parameter as follows to remove any usage of get_order?
> PAGE_SIZE -> 0
> PMD_SIZE -> PMD_SHIFT - PAGE_SHIFT
> PUD_SIZE -> PUD_SHIFT - PAGE_SHIFT

Now that I look at this again, the above makes sense to me.

I'd requested the current form (which I now realise is broken), since
back in v2 the code looked like:

static void free_pagetable(struct page *page, int order)
{
	...
	free_pages((unsigned long)page_address(page), order);
	...
}

... with callsites looking like:

free_pagetable(pud_page(*pud), get_order(PUD_SIZE));

... which I now see is off by PAGE_SHIFT, and we inherited that bug in
the current code, so the calculated order is vastly larger than it
should be. It's worrying that doesn't seem to be caught by anything in
testing. :/

Anshuman, could you please fold in Steve's suggested change? I'll look
at the rest of the series shortly, so no need to resend that right away,
but it would be worth sorting out.

Thanks,
Mark.