On 4/29/21 8:13 PM, Muchun Song wrote: > When we free a HugeTLB page to the buddy allocator, we need to allocate > the vmemmap pages associated with it. However, we may not be able to > allocate the vmemmap pages when the system is under memory pressure. In > this case, we just refuse to free the HugeTLB page. This changes behavior > in some corner cases as listed below: > > 1) Failing to free a huge page triggered by the user (decrease nr_pages). > > User needs to try again later. > > 2) Failing to free a surplus huge page when freed by the application. > > Try again later when freeing a huge page next time. > > 3) Failing to dissolve a free huge page on ZONE_MOVABLE via > offline_pages(). > > This can happen when we have plenty of ZONE_MOVABLE memory, but > not enough kernel memory to allocate vmemmmap pages. We may even > be able to migrate huge page contents, but will not be able to > dissolve the source huge page. This will prevent an offline > operation and is unfortunate as memory offlining is expected to > succeed on movable zones. Users that depend on memory hotplug > to succeed for movable zones should carefully consider whether the > memory savings gained from this feature are worth the risk of > possibly not being able to offline memory in certain situations. > > 4) Failing to dissolve a huge page on CMA/ZONE_MOVABLE via > alloc_contig_range() - once we have that handling in place. Mainly > affects CMA and virtio-mem. > > Similar to 3). virito-mem will handle migration errors gracefully. > CMA might be able to fallback on other free areas within the CMA > region. > > Vmemmap pages are allocated from the page freeing context. In order for > those allocations to be not disruptive (e.g. trigger oom killer) > __GFP_NORETRY is used. hugetlb_lock is dropped for the allocation > because a non sleeping allocation would be too fragile and it could fail > too easily under memory pressure. GFP_ATOMIC or other modes to access > memory reserves is not used because we want to prevent consuming > reserves under heavy hugetlb freeing. > > Signed-off-by: Muchun Song <songmuchun@xxxxxxxxxxxxx> > --- > Documentation/admin-guide/mm/hugetlbpage.rst | 8 ++ > Documentation/admin-guide/mm/memory-hotplug.rst | 13 ++++ > include/linux/hugetlb.h | 3 + > include/linux/mm.h | 2 + > mm/hugetlb.c | 98 +++++++++++++++++++++---- > mm/hugetlb_vmemmap.c | 34 +++++++++ > mm/hugetlb_vmemmap.h | 6 ++ > mm/migrate.c | 5 +- > mm/sparse-vmemmap.c | 75 ++++++++++++++++++- > 9 files changed, 227 insertions(+), 17 deletions(-) > > diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst > index f7b1c7462991..6988895d09a8 100644 > --- a/Documentation/admin-guide/mm/hugetlbpage.rst > +++ b/Documentation/admin-guide/mm/hugetlbpage.rst > @@ -60,6 +60,10 @@ HugePages_Surp > the pool above the value in ``/proc/sys/vm/nr_hugepages``. The > maximum number of surplus huge pages is controlled by > ``/proc/sys/vm/nr_overcommit_hugepages``. > + Note: When the feature of freeing unused vmemmap pages associated > + with each hugetlb page is enabled, the number of surplus huge pages > + may be temporarily larger than the maximum number of surplus huge > + pages when the system is under memory pressure. > Hugepagesize > is the default hugepage size (in Kb). > Hugetlb > @@ -80,6 +84,10 @@ returned to the huge page pool when freed by a task. A user with root > privileges can dynamically allocate more or free some persistent huge pages > by increasing or decreasing the value of ``nr_hugepages``. > > +Note: When the feature of freeing unused vmemmap pages associated with each > +hugetlb page is enabled, we can fail to free the huge pages triggered by > +the user when ths system is under memory pressure. Please try again later. > + > Pages that are used as huge pages are reserved inside the kernel and cannot > be used for other purposes. Huge pages cannot be swapped out under > memory pressure. > diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst > index 05d51d2d8beb..c6bae2d77160 100644 > --- a/Documentation/admin-guide/mm/memory-hotplug.rst > +++ b/Documentation/admin-guide/mm/memory-hotplug.rst > @@ -357,6 +357,19 @@ creates ZONE_MOVABLE as following. > Unfortunately, there is no information to show which memory block belongs > to ZONE_MOVABLE. This is TBD. > > + Memory offlining can fail when dissolving a free huge page on ZONE_MOVABLE > + and the feature of freeing unused vmemmap pages associated with each hugetlb > + page is enabled. > + > + This can happen when we have plenty of ZONE_MOVABLE memory, but not enough > + kernel memory to allocate vmemmmap pages. We may even be able to migrate > + huge page contents, but will not be able to dissolve the source huge page. > + This will prevent an offline operation and is unfortunate as memory offlining > + is expected to succeed on movable zones. Users that depend on memory hotplug > + to succeed for movable zones should carefully consider whether the memory > + savings gained from this feature are worth the risk of possibly not being > + able to offline memory in certain situations. > + > .. note:: > Techniques that rely on long-term pinnings of memory (especially, RDMA and > vfio) are fundamentally problematic with ZONE_MOVABLE and, therefore, memory > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > index d523a345dc86..d3abaaec2a22 100644 > --- a/include/linux/hugetlb.h > +++ b/include/linux/hugetlb.h > @@ -525,6 +525,7 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr, > * code knows it has only reference. All other examinations and > * modifications require hugetlb_lock. > * HPG_freed - Set when page is on the free lists. > + * HPG_vmemmap_optimized - Set when the vmemmap pages of the page are freed. > * Synchronization: hugetlb_lock held for examination and modification. You just moved the Synchronization comment so that it applies to both HPG_freed and HPG_vmemmap_optimized. However, HPG_vmemmap_optimized is checked/modified both with and without hugetlb_lock. Nothing wrong with that, just need to update/fix the comment. Everything else looks good to me, Reviewed-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx> -- Mike Kravetz