On Wed, Apr 21, 2021 at 1:48 AM Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote: > > On 4/20/21 1:46 AM, Muchun Song wrote: > > On Tue, Apr 20, 2021 at 7:20 AM Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote: > >> > >> On 4/15/21 1:40 AM, Muchun Song wrote: > >>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > >>> index 0abed7e766b8..6e970a7d3480 100644 > >>> --- a/include/linux/hugetlb.h > >>> +++ b/include/linux/hugetlb.h > >>> @@ -525,6 +525,7 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr, > >>> * code knows it has only reference. All other examinations and > >>> * modifications require hugetlb_lock. > >>> * HPG_freed - Set when page is on the free lists. > >>> + * HPG_vmemmap_optimized - Set when the vmemmap pages of the page are freed. > >>> * Synchronization: hugetlb_lock held for examination and modification. > >> > >> I like the per-page flag. In previous versions of the series, you just > >> checked the free_vmemmap_pages_per_hpage() to determine if vmemmmap > >> should be allocated. Is there any change in functionality that makes is > >> necessary to set the flag in each page, or is it mostly for flexibility > >> going forward? > > > > Actually, only the routine of dissolving the page cares whether > > the page is on the buddy free list when update_and_free_page > > returns. But we cannot change the return type of the > > update_and_free_page (e.g. change return type from 'void' to 'int'). > > Why? If the hugepage is freed through a kworker, we cannot > > know the return value when update_and_free_page returns. > > So adding a return value seems odd. > > > > In the dissolving routine, We can allocate vmemmap pages first, > > if it is successful, then we can make sure that > > update_and_free_page can successfully free page. So I need > > some stuff to mark the page which does not need to allocate > > vmemmap pages. > > > > On the surface, we seem to have a straightforward method > > to do this. > > > > Add a new parameter 'alloc_vmemmap' to update_and_free_page() to > > indicate that the caller is already allocated the vmemmap pages. > > update_and_free_page() do not need to allocate. Just like below. > > > > void update_and_free_page(struct hstate *h, struct page *page, bool atomic, > > bool alloc_vmemmap) > > { > > if (alloc_vmemmap) > > // allocate vmemmap pages > > } > > > > But if the page is freed through a kworker. How to pass > > 'alloc_vmemmap' to the kworker? We can embed this > > information into the per-page flag. So if we introduce > > HPG_vmemmap_optimized, the parameter of > > alloc_vmemmap is also necessary. > > > > So it seems that introducing HPG_vmemmap_optimized is > > a good choice. > > Thanks for the explanation! > > Agree that the flag is a good choice. How about adding a comment like > this above the alloc_huge_page_vmemmap call in dissolve_free_huge_page? > > /* > * Normally update_and_free_page will allocate required vmemmmap before > * freeing the page. update_and_free_page will fail to free the page > * if it can not allocate required vmemmap. We need to adjust > * max_huge_pages if the page is not freed. Attempt to allocate > * vmemmmap here so that we can take appropriate action on failure. > */ Thanks. I will add this comment. > > ... > >>> +static void add_hugetlb_page(struct hstate *h, struct page *page, > >>> + bool adjust_surplus) > >>> +{ > >> > >> We need to be a bit careful with hugepage specific flags that may be > >> set. The routine remove_hugetlb_page which is called for 'page' before > >> this routine will not clear any of the hugepage specific flags. If the > >> calling path goes through free_huge_page, most but not all flags are > >> cleared. > >> > >> We had a discussion about clearing the page->private field in Oscar's > >> series. In the case of 'new' pages we can assume page->private is > >> cleared, but perhaps we should not make that assumption here. Since we > >> hope to rarely call this routine, it might be safer to do something > >> like: > >> > >> set_page_private(page, 0); > >> SetHPageVmemmapOptimized(page); > > > > Agree. Thanks for your reminder. I will fix this. > > > >> > >>> + int nid = page_to_nid(page); > >>> + > >>> + lockdep_assert_held(&hugetlb_lock); > >>> + > >>> + INIT_LIST_HEAD(&page->lru); > >>> + h->nr_huge_pages++; > >>> + h->nr_huge_pages_node[nid]++; > >>> + > >>> + if (adjust_surplus) { > >>> + h->surplus_huge_pages++; > >>> + h->surplus_huge_pages_node[nid]++; > >>> + } > >>> + > >>> + set_compound_page_dtor(page, HUGETLB_PAGE_DTOR); > >>> + > >>> + /* > >>> + * The refcount can possibly be increased by memory-failure or > >>> + * soft_offline handlers. > >>> + */ > >>> + if (likely(put_page_testzero(page))) { > >> > >> In the existing code there is no such test. Is the need for the test > >> because of something introduced in the new code? > > > > No. > > > >> Or, should this test be in the existing code? > > > > Yes. gather_surplus_pages should be fixed. I can fix it > > in a separate patch. > > > > The possible bad scenario: > > > > CPU0: CPU1: > > set_compound_page_dtor(HUGETLB_PAGE_DTOR); > > memory_failure_hugetlb > > get_hwpoison_page > > __get_hwpoison_page > > get_page_unless_zero > > put_page_testzero() > > > > put_page(page) > > > > > > More details and discussion can refer to: > > > > https://lore.kernel.org/linux-doc/CAMZfGtVRSBkKe=tKAKLY8dp_hywotq3xL+EJZNjXuSKt3HK3bQ@xxxxxxxxxxxxxx/ > > > > Thanks you! I did not remember that discussion. > > It would be helpful to add a separate patch for gather_surplus_pages. > Otherwise, we have the VM_BUG_ON there and not in add_hugetlb_page. > Agree. Will do. > -- > Mike Kravetz