Re: [PATCH v9 4/4] mm: hugetlb_vmemmap: add hugetlb_optimize_vmemmap sysctl

Muchun Song <songmuchun@xxxxxxxxxxxxx> · Thu, 5 May 2022 10:35:57 +0800

On Wed, May 04, 2022 at 03:12:39PM -0700, Mike Kravetz wrote:
> On 4/29/22 05:18, Muchun Song wrote:
> > We must add hugetlb_free_vmemmap=on (or "off") to the boot cmdline and
> > reboot the server to enable or disable the feature of optimizing vmemmap
> > pages associated with HugeTLB pages.  However, rebooting usually takes a
> > long time.  So add a sysctl to enable or disable the feature at runtime
> > without rebooting.  Why we need this?  There are 3 use cases.
> > 
> > 1) The feature of minimizing overhead of struct page associated with each
> > HugeTLB is disabled by default without passing "hugetlb_free_vmemmap=on"
> > to the boot cmdline. When we (ByteDance) deliver the servers to the
> > users who want to enable this feature, they have to configure the grub
> > (change boot cmdline) and reboot the servers, whereas rebooting usually
> > takes a long time (we have thousands of servers).  It's a very bad
> > experience for the users.  So we need a approach to enable this feature
> > after rebooting. This is a use case in our practical environment.
> > 
> > 2) Some use cases are that HugeTLB pages are allocated 'on the fly'
> > instead of being pulled from the HugeTLB pool, those workloads would be
> > affected with this feature enabled.  Those workloads could be identified
> > by the characteristics of they never explicitly allocating huge pages
> > with 'nr_hugepages' but only set 'nr_overcommit_hugepages' and then let
> > the pages be allocated from the buddy allocator at fault time.  We can
> > confirm it is a real use case from the commit 099730d67417.  For those
> > workloads, the page fault time could be ~2x slower than before. We
> > suspect those users want to disable this feature if the system has enabled
> > this before and they don't think the memory savings benefit is enough to
> > make up for the performance drop.
> > 
> > 3) If the workload which wants vmemmap pages to be optimized and the
> > workload which wants to set 'nr_overcommit_hugepages' and does not want
> > the extera overhead at fault time when the overcommitted pages be
> > allocated from the buddy allocator are deployed in the same server.
> > The user could enable this feature and set 'nr_hugepages' and
> > 'nr_overcommit_hugepages', then disable the feature.  In this case,
> > the overcommited HugeTLB pages will not encounter the extra overhead
> > at fault time.
> > 
> > Signed-off-by: Muchun Song <songmuchun@xxxxxxxxxxxxx>
> > ---
> >  Documentation/admin-guide/sysctl/vm.rst | 30 +++++++++++
> >  include/linux/memory_hotplug.h          |  9 ++++
> >  mm/hugetlb_vmemmap.c                    | 92 +++++++++++++++++++++++++++++----
> >  mm/hugetlb_vmemmap.h                    |  4 +-
> >  mm/memory_hotplug.c                     |  7 +--
> >  5 files changed, 126 insertions(+), 16 deletions(-)
> > 
> > diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
> > index 747e325ebcd0..00434789cf26 100644
> > --- a/Documentation/admin-guide/sysctl/vm.rst
> > +++ b/Documentation/admin-guide/sysctl/vm.rst
> > @@ -562,6 +562,36 @@ Change the minimum size of the hugepage pool.
> >  See Documentation/admin-guide/mm/hugetlbpage.rst
> >  
> >  
> > +hugetlb_optimize_vmemmap
> > +========================
> > +
> > +Enable (set to 1) or disable (set to 0) the feature of optimizing vmemmap pages
> > +associated with each HugeTLB page.
> 
> Should we mention that "memory_hotplug.memmap_on_memory" or an unusual system
> config that results in struct page not being a power of 2 will prevent
> enabling?
>

Good question. I think maybe it is better to clarify that this knob exists when
memory_hotplug.memmap_on_memory (kernel parameter) is disabled and the size of
"struct page" must be power of 2 (an unusual system config that could results
in this).

> > +
> > +Once enabled, the vmemmap pages of subsequent allocation of HugeTLB pages from
> > +buddy allocator will be optimized (7 pages per 2MB HugeTLB page and 4095 pages
> > +per 1GB HugeTLB page), whereas already allocated HugeTLB pages will not be
> > +optimized.  When those optimized HugeTLB pages are freed from the HugeTLB pool
> > +to the buddy allocator, the vmemmap pages representing that range needs to be
> > +remapped again and the vmemmap pages discarded earlier need to be rellocated
> > +again.  If your use case is that HugeTLB pages are allocated 'on the fly' (e.g.
> > +never explicitly allocating HugeTLB pages with 'nr_hugepages' but only set
> > +'nr_overcommit_hugepages', those overcommitted HugeTLB pages are allocated 'on
> > +the fly') instead of being pulled from the HugeTLB pool, you should weigh the
> > +benefits of memory savings against the more overhead (~2x slower than before)
> > +of allocation or freeing HugeTLB pages between the HugeTLB pool and the buddy
> > +allocator.  Another behavior to note is that if the system is under heavy memory
> > +pressure, it could prevent the user from freeing HugeTLB pages from the HugeTLB
> > +pool to the buddy allocator since the allocation of vmemmap pages could be
> > +failed, you have to retry later if your system encounter this situation.
> > +
> > +Once disabled, the vmemmap pages of subsequent allocation of HugeTLB pages from
> > +buddy allocator will not be optimized meaning the extra overhead at allocation
> > +time from buddy allocator disappears, whereas already optimized HugeTLB pages
> > +will not be affected.  If you want to make sure there is no optimized HugeTLB
> > +pages, you can set "nr_hugepages" to 0 first and then disable this.
> 
> Thank you for adding this documentation.
> 
> We may want to clarify that last statement about making sure there are no
> optimized HugeTLB pages.  Writing 0 to nr_hugepages will make any "in use"
> HugeTLB pages become surplus pages.  So, those surplus pages are still
> optimized until they are no longer in use.  You would need to wait for
> those surplus pages to be released before there are no optimized pages in
> the system.
>

I didn't realize this, I'll add this into documentation, Thanks.

> > +
> > +
> >  nr_hugepages_mempolicy
> >  ======================
> >  
> > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> > index 029fb7e26504..917112661b5c 100644
> > --- a/include/linux/memory_hotplug.h
> > +++ b/include/linux/memory_hotplug.h
> > @@ -351,4 +351,13 @@ void arch_remove_linear_mapping(u64 start, u64 size);
> >  extern bool mhp_supports_memmap_on_memory(unsigned long size);
> >  #endif /* CONFIG_MEMORY_HOTPLUG */
> >  
> > +#ifdef CONFIG_MHP_MEMMAP_ON_MEMORY
> > +bool mhp_memmap_on_memory(void);
> > +#else
> > +static inline bool mhp_memmap_on_memory(void)
> > +{
> > +	return false;
> > +}
> > +#endif
> > +
> >  #endif /* __LINUX_MEMORY_HOTPLUG_H */
> > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > index cc4ec752ec16..5820a681a724 100644
> > --- a/mm/hugetlb_vmemmap.c
> > +++ b/mm/hugetlb_vmemmap.c
> > @@ -10,6 +10,7 @@
> >   */
> >  #define pr_fmt(fmt)	"HugeTLB: " fmt
> >  
> > +#include <linux/memory_hotplug.h>
> >  #include "hugetlb_vmemmap.h"
> >  
> >  /*
> > @@ -22,21 +23,40 @@
> >  #define RESERVE_VMEMMAP_NR		1U
> >  #define RESERVE_VMEMMAP_SIZE		(RESERVE_VMEMMAP_NR << PAGE_SHIFT)
> >  
> > +enum vmemmap_optimize_mode {
> > +	VMEMMAP_OPTIMIZE_OFF,
> > +	VMEMMAP_OPTIMIZE_ON,
> > +};
> > +
> >  DEFINE_STATIC_KEY_MAYBE(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON,
> >  			hugetlb_optimize_vmemmap_key);
> >  EXPORT_SYMBOL(hugetlb_optimize_vmemmap_key);
> >  
> > +static enum vmemmap_optimize_mode vmemmap_optimize_mode =
> > +	IS_ENABLED(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON);
> > +
> > +static void vmemmap_optimize_mode_switch(enum vmemmap_optimize_mode to)
> > +{
> > +	if (vmemmap_optimize_mode == to)
> > +		return;
> > +
> > +	if (to == VMEMMAP_OPTIMIZE_OFF)
> > +		static_branch_dec(&hugetlb_optimize_vmemmap_key);
> > +	else
> > +		static_branch_inc(&hugetlb_optimize_vmemmap_key);
> > +	vmemmap_optimize_mode = to;
> > +}
> > +
> >  static int __init hugetlb_vmemmap_early_param(char *buf)
> >  {
> >  	bool enable;
> > +	enum vmemmap_optimize_mode mode;
> >  
> >  	if (kstrtobool(buf, &enable))
> >  		return -EINVAL;
> >  
> > -	if (enable)
> > -		static_branch_enable(&hugetlb_optimize_vmemmap_key);
> > -	else
> > -		static_branch_disable(&hugetlb_optimize_vmemmap_key);
> > +	mode = enable ? VMEMMAP_OPTIMIZE_ON : VMEMMAP_OPTIMIZE_OFF;
> > +	vmemmap_optimize_mode_switch(mode);
> >  
> >  	return 0;
> >  }
> > @@ -60,6 +80,8 @@ int hugetlb_vmemmap_alloc(struct hstate *h, struct page *head)
> >  	vmemmap_end	= vmemmap_addr + (vmemmap_pages << PAGE_SHIFT);
> >  	vmemmap_reuse	= vmemmap_addr - PAGE_SIZE;
> >  
> > +	VM_BUG_ON_PAGE(!vmemmap_pages, head);
> > +
> >  	/*
> >  	 * The pages which the vmemmap virtual address range [@vmemmap_addr,
> >  	 * @vmemmap_end) are mapped to are freed to the buddy allocator, and
> > @@ -69,8 +91,10 @@ int hugetlb_vmemmap_alloc(struct hstate *h, struct page *head)
> >  	 */
> >  	ret = vmemmap_remap_alloc(vmemmap_addr, vmemmap_end, vmemmap_reuse,
> >  				  GFP_KERNEL | __GFP_NORETRY | __GFP_THISNODE);
> > -	if (!ret)
> > +	if (!ret) {
> >  		ClearHPageVmemmapOptimized(head);
> > +		static_branch_dec(&hugetlb_optimize_vmemmap_key);
> > +	}
> >  
> >  	return ret;
> >  }
> > @@ -84,6 +108,8 @@ void hugetlb_vmemmap_free(struct hstate *h, struct page *head)
> >  	if (!vmemmap_pages)
> >  		return;
> >  
> > +	static_branch_inc(&hugetlb_optimize_vmemmap_key);
> 
> Can you explain the reasoning behind doing the static_branch_inc here in free,
> and static_branch_dec in alloc?
> IIUC, they may not be absolutely necessary but you could use the count to
> know how many optimized pages are in use?  Or, I may just be missing
> something.
>

Partly right. One 'count' is not enough. I have implemented this with similar
approach in v6 [1]. Except the 'count', we also need a lock to do synchronization.
However, both count and synchronization are included in static_key_inc/dec
infrastructure. It is simpler to use static_key_inc/dec directly, right? 

[1] https://lore.kernel.org/all/20220330153745.20465-5-songmuchun@xxxxxxxxxxxxx/

Thanks.