The patch titled Subject: hugetlb-add-hugetlb-demote-page-support-v4 has been added to the -mm tree. Its filename is hugetlb-add-hugetlb-demote-page-support-v4.patch This patch should soon appear at https://ozlabs.org/~akpm/mmots/broken-out/hugetlb-add-hugetlb-demote-page-support-v4.patch and later at https://ozlabs.org/~akpm/mmotm/broken-out/hugetlb-add-hugetlb-demote-page-support-v4.patch Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next and is updated there every 3-4 working days ------------------------------------------------------ From: Mike Kravetz <mike.kravetz@xxxxxxxxxx> Subject: hugetlb-add-hugetlb-demote-page-support-v4 On 10/7/21 11:19 AM, Mike Kravetz wrote: > +static int demote_free_huge_page(struct hstate *h, struct page *page) > +{ > + int i, nid = page_to_nid(page); > + struct hstate *target_hstate; > + int rc = 0; > + > + target_hstate = size_to_hstate(PAGE_SIZE << h->demote_order); > + > + remove_hugetlb_page_for_demote(h, page, false); > + spin_unlock_irq(&hugetlb_lock); > + > + rc = alloc_huge_page_vmemmap(h, page); > + if (rc) { > + /* Allocation of vmemmmap failed, we can not demote page */ > + spin_lock_irq(&hugetlb_lock); > + set_page_refcounted(page); > + add_hugetlb_page(h, page, false); > + return rc; > + } > + > + /* > + * Use destroy_compound_hugetlb_page_for_demote for all huge page > + * sizes as it will not ref count pages. > + */ > + destroy_compound_hugetlb_page_for_demote(page, huge_page_order(h)); > + > + for (i = 0; i < pages_per_huge_page(h); > + i += pages_per_huge_page(target_hstate)) { > + if (hstate_is_gigantic(target_hstate)) > + prep_compound_gigantic_page_for_demote(page + i, > + target_hstate->order); > + else > + prep_compound_page(page + i, target_hstate->order); > + set_page_private(page + i, 0); > + set_page_refcounted(page + i); > + prep_new_huge_page(target_hstate, page + i, nid); > + put_page(page + i); > + } I was doing some stress testing with multiple concurrent writers to sysfs/.../nr_hugepages and sysfs/.../demote. On occasion, I would see unexpected surplus pages of the smaller huge page size (2M on x86). Here is what was happening. One task was decrementing the number of 2M huge pages with "echo 0 > nr_hugepages. It proceeded to the routine set_max_huge_pages and was executing the following: /* * Decrease the pool size * First return free pages to the buddy allocator (being careful * to keep enough around to satisfy reservations). Then place * pages into surplus state as needed so the pool will shrink * to the desired size as pages become free. * * By placing pages into the surplus state independent of the * overcommit value, we are allowing the surplus pool size to * exceed overcommit. There are few sane options here. Since * alloc_surplus_huge_page() is checking the global counter, * though, we'll note that we're not allowed to exceed surplus * and won't grow the pool anywhere else. Not until one of the * sysctls are changed, or the surplus pages go out of use. */ min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages; min_count = max(count, min_count); try_to_free_low(h, min_count, nodes_allowed); /* * Collect pages to be removed on list without dropping lock */ while (min_count < persistent_huge_pages(h)) { page = remove_pool_huge_page(h, nodes_allowed, 0); if (!page) break; list_add(&page->lru, &page_list); } /* free the pages after dropping lock */ spin_unlock_irq(&hugetlb_lock); update_and_free_pages_bulk(h, &page_list); flush_free_hpage_work(h); Now, while the lock was dropped the routine demote_free_huge_page above added 512 huge pages to the 2M pool. spin_lock_irq(&hugetlb_lock); Then after acquiring the lock we make these 512 pages surplus. while (count < persistent_huge_pages(h)) { if (!adjust_pool_surplus(h, nodes_allowed, 1)) break; } To prevent this race from happening in general, the hstate specific mutex resize_lock is held for the duration of set_max_huge_pages. Since, the demote code is also adjusting pool sizes it should also take the mutex. The routine demote_store already takes the mutex of the hstate of the page size being demoted (1M in this case). That is because the 1M pool size will be decreased. We also need to take the resize mutex of the 2M pool as this pool will be increased. To prevent deadlocks, we use the convention of always taking the resize mutex of the larger hstate first. An updated version of this patch below adds taking the 'target hstate' mutex in demote_free_huge_page. Although unnecessary, it also updates max_huge_pages of both hstates for consistency. Demote page functionality will split a huge page into a number of huge pages of a smaller size. For example, on x86 a 1GB huge page can be demoted into 512 2M huge pages. Demotion is done 'in place' by simply splitting the huge page. Added '*_for_demote' wrappers for remove_hugetlb_page, destroy_compound_hugetlb_page and prep_compound_gigantic_page for use by demote code. Link: https://lkml.kernel.org/r/6ca29b8e-527c-d6ec-900e-e6a43e4f8b73@xxxxxxxxxx Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx> Cc: "Aneesh Kumar K . V" <aneesh.kumar@xxxxxxxxxxxxx> Cc: David Hildenbrand <david@xxxxxxxxxx> Cc: David Rientjes <rientjes@xxxxxxxxxx> Cc: Michal Hocko <mhocko@xxxxxxxx> Cc: Muchun Song <songmuchun@xxxxxxxxxxxxx> Cc: Naoya Horiguchi <naoya.horiguchi@xxxxxxxxx> Cc: Nghia Le <nghialm78@xxxxxxxxx> Cc: Oscar Salvador <osalvador@xxxxxxx> Cc: Zi Yan <ziy@xxxxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- mm/hugetlb.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) --- a/mm/hugetlb.c~hugetlb-add-hugetlb-demote-page-support-v4 +++ a/mm/hugetlb.c @@ -3349,6 +3349,15 @@ static int demote_free_huge_page(struct */ destroy_compound_hugetlb_page_for_demote(page, huge_page_order(h)); + /* + * Taking target hstate mutex synchronizes with set_max_huge_pages. + * Without the mutex, pages added to target hstate could be marked + * as surplus. + * + * Note that we already hold h->resize_lock. To prevent deadlock, + * use the convention of always taking larger size hstate mutex first. + */ + mutex_lock(&target_hstate->resize_lock); for (i = 0; i < pages_per_huge_page(h); i += pages_per_huge_page(target_hstate)) { if (hstate_is_gigantic(target_hstate)) @@ -3361,8 +3370,17 @@ static int demote_free_huge_page(struct prep_new_huge_page(target_hstate, page + i, nid); put_page(page + i); } + mutex_unlock(&target_hstate->resize_lock); spin_lock_irq(&hugetlb_lock); + + /* + * Not absolutely necessary, but for consistency update max_huge_pages + * based on pool changes for the demoted page. + */ + h->max_huge_pages--; + target_hstate->max_huge_pages += pages_per_huge_page(h); + return rc; } _ Patches currently in -mm which might be from mike.kravetz@xxxxxxxxxx are hugetlb-add-demote-hugetlb-page-sysfs-interfaces.patch mm-cma-add-cma_pages_valid-to-determine-if-pages-are-in-cma.patch hugetlb-be-sure-to-free-demoted-cma-pages-to-cma.patch hugetlb-add-demote-bool-to-gigantic-page-routines.patch hugetlb-add-hugetlb-demote-page-support.patch hugetlb-add-hugetlb-demote-page-support-v4.patch