On 4/20/21 1:46 AM, Muchun Song wrote: > On Tue, Apr 20, 2021 at 7:20 AM Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote: >> >> On 4/15/21 1:40 AM, Muchun Song wrote: >>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h >>> index 0abed7e766b8..6e970a7d3480 100644 >>> --- a/include/linux/hugetlb.h >>> +++ b/include/linux/hugetlb.h >>> @@ -525,6 +525,7 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr, >>> * code knows it has only reference. All other examinations and >>> * modifications require hugetlb_lock. >>> * HPG_freed - Set when page is on the free lists. >>> + * HPG_vmemmap_optimized - Set when the vmemmap pages of the page are freed. >>> * Synchronization: hugetlb_lock held for examination and modification. >> >> I like the per-page flag. In previous versions of the series, you just >> checked the free_vmemmap_pages_per_hpage() to determine if vmemmmap >> should be allocated. Is there any change in functionality that makes is >> necessary to set the flag in each page, or is it mostly for flexibility >> going forward? > > Actually, only the routine of dissolving the page cares whether > the page is on the buddy free list when update_and_free_page > returns. But we cannot change the return type of the > update_and_free_page (e.g. change return type from 'void' to 'int'). > Why? If the hugepage is freed through a kworker, we cannot > know the return value when update_and_free_page returns. > So adding a return value seems odd. > > In the dissolving routine, We can allocate vmemmap pages first, > if it is successful, then we can make sure that > update_and_free_page can successfully free page. So I need > some stuff to mark the page which does not need to allocate > vmemmap pages. > > On the surface, we seem to have a straightforward method > to do this. > > Add a new parameter 'alloc_vmemmap' to update_and_free_page() to > indicate that the caller is already allocated the vmemmap pages. > update_and_free_page() do not need to allocate. Just like below. > > void update_and_free_page(struct hstate *h, struct page *page, bool atomic, > bool alloc_vmemmap) > { > if (alloc_vmemmap) > // allocate vmemmap pages > } > > But if the page is freed through a kworker. How to pass > 'alloc_vmemmap' to the kworker? We can embed this > information into the per-page flag. So if we introduce > HPG_vmemmap_optimized, the parameter of > alloc_vmemmap is also necessary. > > So it seems that introducing HPG_vmemmap_optimized is > a good choice. Thanks for the explanation! Agree that the flag is a good choice. How about adding a comment like this above the alloc_huge_page_vmemmap call in dissolve_free_huge_page? /* * Normally update_and_free_page will allocate required vmemmmap before * freeing the page. update_and_free_page will fail to free the page * if it can not allocate required vmemmap. We need to adjust * max_huge_pages if the page is not freed. Attempt to allocate * vmemmmap here so that we can take appropriate action on failure. */ ... >>> +static void add_hugetlb_page(struct hstate *h, struct page *page, >>> + bool adjust_surplus) >>> +{ >> >> We need to be a bit careful with hugepage specific flags that may be >> set. The routine remove_hugetlb_page which is called for 'page' before >> this routine will not clear any of the hugepage specific flags. If the >> calling path goes through free_huge_page, most but not all flags are >> cleared. >> >> We had a discussion about clearing the page->private field in Oscar's >> series. In the case of 'new' pages we can assume page->private is >> cleared, but perhaps we should not make that assumption here. Since we >> hope to rarely call this routine, it might be safer to do something >> like: >> >> set_page_private(page, 0); >> SetHPageVmemmapOptimized(page); > > Agree. Thanks for your reminder. I will fix this. > >> >>> + int nid = page_to_nid(page); >>> + >>> + lockdep_assert_held(&hugetlb_lock); >>> + >>> + INIT_LIST_HEAD(&page->lru); >>> + h->nr_huge_pages++; >>> + h->nr_huge_pages_node[nid]++; >>> + >>> + if (adjust_surplus) { >>> + h->surplus_huge_pages++; >>> + h->surplus_huge_pages_node[nid]++; >>> + } >>> + >>> + set_compound_page_dtor(page, HUGETLB_PAGE_DTOR); >>> + >>> + /* >>> + * The refcount can possibly be increased by memory-failure or >>> + * soft_offline handlers. >>> + */ >>> + if (likely(put_page_testzero(page))) { >> >> In the existing code there is no such test. Is the need for the test >> because of something introduced in the new code? > > No. > >> Or, should this test be in the existing code? > > Yes. gather_surplus_pages should be fixed. I can fix it > in a separate patch. > > The possible bad scenario: > > CPU0: CPU1: > set_compound_page_dtor(HUGETLB_PAGE_DTOR); > memory_failure_hugetlb > get_hwpoison_page > __get_hwpoison_page > get_page_unless_zero > put_page_testzero() > > put_page(page) > > > More details and discussion can refer to: > > https://lore.kernel.org/linux-doc/CAMZfGtVRSBkKe=tKAKLY8dp_hywotq3xL+EJZNjXuSKt3HK3bQ@xxxxxxxxxxxxxx/ > Thanks you! I did not remember that discussion. It would be helpful to add a separate patch for gather_surplus_pages. Otherwise, we have the VM_BUG_ON there and not in add_hugetlb_page. -- Mike Kravetz