Re: [PATCH v22 6/9] mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page

Mike Kravetz <mike.kravetz@xxxxxxxxxx> · Wed, 5 May 2021 15:21:11 -0700

On 4/29/21 8:13 PM, Muchun Song wrote:
> When we free a HugeTLB page to the buddy allocator, we need to allocate
> the vmemmap pages associated with it. However, we may not be able to
> allocate the vmemmap pages when the system is under memory pressure. In
> this case, we just refuse to free the HugeTLB page. This changes behavior
> in some corner cases as listed below:
> 
>  1) Failing to free a huge page triggered by the user (decrease nr_pages).
> 
>     User needs to try again later.
> 
>  2) Failing to free a surplus huge page when freed by the application.
> 
>     Try again later when freeing a huge page next time.
> 
>  3) Failing to dissolve a free huge page on ZONE_MOVABLE via
>     offline_pages().
> 
>     This can happen when we have plenty of ZONE_MOVABLE memory, but
>     not enough kernel memory to allocate vmemmmap pages.  We may even
>     be able to migrate huge page contents, but will not be able to
>     dissolve the source huge page.  This will prevent an offline
>     operation and is unfortunate as memory offlining is expected to
>     succeed on movable zones.  Users that depend on memory hotplug
>     to succeed for movable zones should carefully consider whether the
>     memory savings gained from this feature are worth the risk of
>     possibly not being able to offline memory in certain situations.
> 
>  4) Failing to dissolve a huge page on CMA/ZONE_MOVABLE via
>     alloc_contig_range() - once we have that handling in place. Mainly
>     affects CMA and virtio-mem.
> 
>     Similar to 3). virito-mem will handle migration errors gracefully.
>     CMA might be able to fallback on other free areas within the CMA
>     region.
> 
> Vmemmap pages are allocated from the page freeing context. In order for
> those allocations to be not disruptive (e.g. trigger oom killer)
> __GFP_NORETRY is used. hugetlb_lock is dropped for the allocation
> because a non sleeping allocation would be too fragile and it could fail
> too easily under memory pressure. GFP_ATOMIC or other modes to access
> memory reserves is not used because we want to prevent consuming
> reserves under heavy hugetlb freeing.
> 
> Signed-off-by: Muchun Song <songmuchun@xxxxxxxxxxxxx>
> ---
>  Documentation/admin-guide/mm/hugetlbpage.rst    |  8 ++
>  Documentation/admin-guide/mm/memory-hotplug.rst | 13 ++++
>  include/linux/hugetlb.h                         |  3 +
>  include/linux/mm.h                              |  2 +
>  mm/hugetlb.c                                    | 98 +++++++++++++++++++++----
>  mm/hugetlb_vmemmap.c                            | 34 +++++++++
>  mm/hugetlb_vmemmap.h                            |  6 ++
>  mm/migrate.c                                    |  5 +-
>  mm/sparse-vmemmap.c                             | 75 ++++++++++++++++++-
>  9 files changed, 227 insertions(+), 17 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
> index f7b1c7462991..6988895d09a8 100644
> --- a/Documentation/admin-guide/mm/hugetlbpage.rst
> +++ b/Documentation/admin-guide/mm/hugetlbpage.rst
> @@ -60,6 +60,10 @@ HugePages_Surp
>          the pool above the value in ``/proc/sys/vm/nr_hugepages``. The
>          maximum number of surplus huge pages is controlled by
>          ``/proc/sys/vm/nr_overcommit_hugepages``.
> +	Note: When the feature of freeing unused vmemmap pages associated
> +	with each hugetlb page is enabled, the number of surplus huge pages
> +	may be temporarily larger than the maximum number of surplus huge
> +	pages when the system is under memory pressure.
>  Hugepagesize
>  	is the default hugepage size (in Kb).
>  Hugetlb
> @@ -80,6 +84,10 @@ returned to the huge page pool when freed by a task.  A user with root
>  privileges can dynamically allocate more or free some persistent huge pages
>  by increasing or decreasing the value of ``nr_hugepages``.
>  
> +Note: When the feature of freeing unused vmemmap pages associated with each
> +hugetlb page is enabled, we can fail to free the huge pages triggered by
> +the user when ths system is under memory pressure.  Please try again later.
> +
>  Pages that are used as huge pages are reserved inside the kernel and cannot
>  be used for other purposes.  Huge pages cannot be swapped out under
>  memory pressure.
> diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst
> index 05d51d2d8beb..c6bae2d77160 100644
> --- a/Documentation/admin-guide/mm/memory-hotplug.rst
> +++ b/Documentation/admin-guide/mm/memory-hotplug.rst
> @@ -357,6 +357,19 @@ creates ZONE_MOVABLE as following.
>     Unfortunately, there is no information to show which memory block belongs
>     to ZONE_MOVABLE. This is TBD.
>  
> +   Memory offlining can fail when dissolving a free huge page on ZONE_MOVABLE
> +   and the feature of freeing unused vmemmap pages associated with each hugetlb
> +   page is enabled.
> +
> +   This can happen when we have plenty of ZONE_MOVABLE memory, but not enough
> +   kernel memory to allocate vmemmmap pages.  We may even be able to migrate
> +   huge page contents, but will not be able to dissolve the source huge page.
> +   This will prevent an offline operation and is unfortunate as memory offlining
> +   is expected to succeed on movable zones.  Users that depend on memory hotplug
> +   to succeed for movable zones should carefully consider whether the memory
> +   savings gained from this feature are worth the risk of possibly not being
> +   able to offline memory in certain situations.
> +
>  .. note::
>     Techniques that rely on long-term pinnings of memory (especially, RDMA and
>     vfio) are fundamentally problematic with ZONE_MOVABLE and, therefore, memory
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index d523a345dc86..d3abaaec2a22 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -525,6 +525,7 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>   *	code knows it has only reference.  All other examinations and
>   *	modifications require hugetlb_lock.
>   * HPG_freed - Set when page is on the free lists.
> + * HPG_vmemmap_optimized - Set when the vmemmap pages of the page are freed.
>   *	Synchronization: hugetlb_lock held for examination and modification.

You just moved the Synchronization comment so that it applies to both
HPG_freed and HPG_vmemmap_optimized.  However, HPG_vmemmap_optimized is
checked/modified both with and without hugetlb_lock.  Nothing wrong with
that, just need to update/fix the comment.

Everything else looks good to me,

Reviewed-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx>

-- 
Mike Kravetz