> On Aug 11, 2022, at 06:38, Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote: > > On 08/10/22 14:20, Muchun Song wrote: >>> On Aug 9, 2022, at 05:28, Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote: >>> >>> When creating hugetlb pages, the hugetlb code must first allocate >>> contiguous pages from a low level allocator such as buddy, cma or >>> memblock. The pages returned from these low level allocators are >>> ref counted. This creates potential issues with other code taking >>> speculative references on these pages before they can be transformed to >>> a hugetlb page. This issue has been addressed with methods and code >>> such as that provided in [1]. >>> >>> Recent discussions about vmemmap freeing [2] have indicated that it >>> would be beneficial to freeze all sub pages, including the head page >>> of pages returned from low level allocators before converting to a >>> hugetlb page. This helps avoid races if want to replace the page >>> containing vmemmap for the head page. >>> >>> There have been proposals to change at least the buddy allocator to >>> return frozen pages as described at [3]. If such a change is made, it >>> can be employed by the hugetlb code. However, as mentioned above >>> hugetlb uses several low level allocators so each would need to be >>> modified to return frozen pages. For now, we can manually freeze the >>> returned pages. This is done in two places: >>> 1) alloc_buddy_huge_page, only the returned head page is ref counted. >>> We freeze the head page, retrying once in the VERY rare case where >>> there may be an inflated ref count. >>> 2) prep_compound_gigantic_page, for gigantic pages the current code >>> freezes all pages except the head page. New code will simply freeze >>> the head page as well. >>> >>> In a few other places, code checks for inflated ref counts on newly >>> allocated hugetlb pages. With the modifications to freeze after >>> allocating, this code can be removed. >>> >>> After hugetlb pages are freshly allocated, they are often added to the >>> hugetlb free lists. Since these pages were previously ref counted, this >>> was done via put_page() which would end up calling the hugetlb >>> destructor: free_huge_page. With changes to freeze pages, we simply >>> call free_huge_page directly to add the pages to the free list. >>> >>> In a few other places, freshly allocated hugetlb pages were immediately >>> put into use, and the expectation was they were already ref counted. In >>> these cases, we must manually ref count the page. >>> >>> [1] https://lore.kernel.org/linux-mm/20210622021423.154662-3-mike.kravetz@xxxxxxxxxx/ >>> [2] https://lore.kernel.org/linux-mm/20220802180309.19340-1-joao.m.martins@xxxxxxxxxx/ >>> [3] https://lore.kernel.org/linux-mm/20220531150611.1303156-1-willy@xxxxxxxxxxxxx/ >>> >>> Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx> >>> --- >>> mm/hugetlb.c | 97 +++++++++++++++++++--------------------------------- >>> 1 file changed, 35 insertions(+), 62 deletions(-) >>> >>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>> index 28516881a1b2..6b90d85d545b 100644 >>> --- a/mm/hugetlb.c >>> +++ b/mm/hugetlb.c >>> @@ -1769,13 +1769,12 @@ static bool __prep_compound_gigantic_page(struct page *page, unsigned int order, >>> { >>> int i, j; >>> int nr_pages = 1 << order; >>> - struct page *p = page + 1; >>> + struct page *p = page; >>> >>> /* we rely on prep_new_huge_page to set the destructor */ >>> set_compound_order(page, order); >>> - __ClearPageReserved(page); >>> __SetPageHead(page); >>> - for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) { >>> + for (i = 0; i < nr_pages; i++, p = mem_map_next(p, page, i)) { >>> /* >>> * For gigantic hugepages allocated through bootmem at >>> * boot, it's safer to be consistent with the not-gigantic >>> @@ -1814,7 +1813,8 @@ static bool __prep_compound_gigantic_page(struct page *page, unsigned int order, >>> } else { >>> VM_BUG_ON_PAGE(page_count(p), p); >>> } >>> - set_compound_head(p, page); >>> + if (i != 0) >>> + set_compound_head(p, page); >>> } >>> atomic_set(compound_mapcount_ptr(page), -1); >>> atomic_set(compound_pincount_ptr(page), 0); >>> @@ -1918,6 +1918,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, >>> int order = huge_page_order(h); >>> struct page *page; >>> bool alloc_try_hard = true; >>> + bool retry = true; >>> >>> /* >>> * By default we always try hard to allocate the page with >>> @@ -1933,7 +1934,21 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, >>> gfp_mask |= __GFP_RETRY_MAYFAIL; >>> if (nid == NUMA_NO_NODE) >>> nid = numa_mem_id(); >>> +retry: >>> page = __alloc_pages(gfp_mask, order, nid, nmask); >>> + >>> + /* Freeze head page */ >>> + if (!page_ref_freeze(page, 1)) { >> >> Hi Mike, >> >> I saw Mattew has introduced a new helper alloc_frozen_pages() in thread [1] to allocate a >> frozen page. Then we do not need to handle an unexpected refcount case, which >> should be easy. Is there any consideration why we do not choose alloc_frozen_pages()? > > I asked Matthew about these efforts before creating this patch. At the > time, there were issues with the first version of his patch series and > he wasn't sure when he would get around to looking into those issues. > > I then decided to proceed with manually freezing pages after allocation. > The thought was that alloc_frozen_pages() could be added when it became > available. The 'downstream changes' to deal with pages that have zero > ref count should remain the same. IMO, these downstream changes are the > more important parts of this patch. > > Shortly after sending this patch, Matthew took another look at his > series and discovered the source of issues. He then sent this v2 > series. Matthew will correct me if this is not accurate. Thanks Mike, it is really clear to me. > >> >> [1] https://lore.kernel.org/linux-mm/20220809171854.3725722-15-willy@xxxxxxxxxxxxx/T/#u >> > > I am happy to wait until Matthew's series moves forward, and then use > the new interface. I agree. Let’s wait together. Muchun > > -- > Mike Kravetz