On 4/13/21 3:47 AM, Oscar Salvador wrote: > alloc_contig_range will fail if it ever sees a HugeTLB page within the > range we are trying to allocate, even when that page is free and can be > easily reallocated. > This has proved to be problematic for some users of alloc_contic_range, > e.g: CMA and virtio-mem, where those would fail the call even when those > pages lay in ZONE_MOVABLE and are free. > > We can do better by trying to replace such page. > > Free hugepages are tricky to handle so as to no userspace application > notices disruption, we need to replace the current free hugepage with > a new one. > > In order to do that, a new function called alloc_and_dissolve_huge_page > is introduced. > This function will first try to get a new fresh hugepage, and if it > succeeds, it will replace the old one in the free hugepage pool. > > The free page replacement is done under hugetlb_lock, so no external > users of hugetlb will notice the change. > To allocate the new huge page, we use alloc_buddy_huge_page(), so we > do not have to deal with any counters, and prep_new_huge_page() is not > called. This is valulable because in case we need to free the new page, > we only need to call __free_pages(). > > Once we know that the page to be replaced is a genuine 0-refcounted > huge page, we remove the old page from the freelist by remove_hugetlb_page(). > Then, we can call __prep_new_huge_page() and __prep_account_new_huge_page() > for the new huge page to properly initialize it and increment the > hstate->nr_huge_pages counter (previously decremented by > remove_hugetlb_page()). > Once done, the page is enqueued by enqueue_huge_page() and it is ready > to be used. > > There is one tricky case when > page's refcount is 0 because it is in the process of being released. > A missing PageHugeFreed bit will tell us that freeing is in flight so > we retry after dropping the hugetlb_lock. The race window should be > small and the next retry should make a forward progress. > > E.g: > > CPU0 CPU1 > free_huge_page() isolate_or_dissolve_huge_page > PageHuge() == T > alloc_and_dissolve_huge_page > alloc_buddy_huge_page() > spin_lock_irq(hugetlb_lock) > // PageHuge() && !PageHugeFreed && > // !PageCount() > spin_unlock_irq(hugetlb_lock) > spin_lock_irq(hugetlb_lock) > 1) update_and_free_page > PageHuge() == F > __free_pages() > 2) enqueue_huge_page > SetPageHugeFreed() > spin_unlock(&hugetlb_lock) Very small nit, the above should be spin_unlock_irq(&hugetlb_lock) > spin_lock_irq(hugetlb_lock) > 1) PageHuge() == F (freed by case#1 from CPU0) > 2) PageHuge() == T > PageHugeFreed() == T > - proceed with replacing the page > > In the case above we retry as the window race is quite small and we have high > chances to succeed next time. > > With regard to the allocation, we restrict it to the node the page belongs > to with __GFP_THISNODE, meaning we do not fallback on other node's zones. > > Note that gigantic hugetlb pages are fenced off since there is a cyclic > dependency between them and alloc_contig_range. > > Signed-off-by: Oscar Salvador <osalvador@xxxxxxx> > --- ... > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 0607b2b71ac6..4a664d6e82c1 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -2266,6 +2266,121 @@ static void restore_reserve_on_error(struct hstate *h, > } > } > > +/* > + * alloc_and_dissolve_huge_page - Allocate a new page and dissolve the old one > + * @h: struct hstate old page belongs to > + * @old_page: Old page to dissolve > + * Returns 0 on success, otherwise negated error. > + */ > + > +static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page) > +{ > + gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE; > + int nid = page_to_nid(old_page); > + struct page *new_page; > + int ret = 0; > + > + /* > + * Before dissolving the page, we need to allocate a new one for the > + * pool to remain stable. Using alloc_buddy_huge_page() allows us to > + * not having to deal with prep_new_page() and avoids dealing of any > + * counters. This simplifies and let us do the whole thing under the > + * lock. > + */ > + new_page = alloc_buddy_huge_page(h, gfp_mask, nid, NULL, NULL); > + if (!new_page) > + return -ENOMEM; > + > +retry: > + spin_lock_irq(&hugetlb_lock); > + if (!PageHuge(old_page)) { > + /* > + * Freed from under us. Drop new_page too. > + */ > + goto free_new; > + } else if (page_count(old_page)) { > + /* > + * Someone has grabbed the page, fail for now. > + */ > + ret = -EBUSY; > + goto free_new; > + } else if (!HPageFreed(old_page)) { > + /* > + * Page's refcount is 0 but it has not been enqueued in the > + * freelist yet. Race window is small, so we can succeed here if > + * we retry. > + */ > + spin_unlock_irq(&hugetlb_lock); > + cond_resched(); > + goto retry; > + } else { > + /* > + * Ok, old_page is still a genuine free hugepage. Remove it from > + * the freelist and decrease the counters. These will be > + * incremented again when calling __prep_account_new_huge_page() > + * and enqueue_huge_page() for new_page. The counters will remain > + * stable since this happens under the lock. > + */ > + remove_hugetlb_page(h, old_page, false); > + > + /* > + * Call __prep_new_huge_page() to construct the hugetlb page, and > + * enqueue it then to place it in the freelists. After this, > + * counters are back on track. Free hugepages have a refcount of 0, > + * so we need to decrease new_page's count as well. > + */ > + __prep_new_huge_page(new_page); > + __prep_account_new_huge_page(h, nid); > + page_ref_dec(new_page); > + enqueue_huge_page(h, new_page); > + > + /* > + * Pages have been replaced, we can safely free the old one. > + */ > + spin_unlock_irq(&hugetlb_lock); > + update_and_free_page(h, old_page); > + } > + > + return ret; > + > +free_new: > + spin_unlock_irq(&hugetlb_lock); > + __free_pages(new_page, huge_page_order(h)); > + > + return ret; > +} > + > +int isolate_or_dissolve_huge_page(struct page *page) > +{ > + struct hstate *h; > + struct page *head; > + > + /* > + * The page might have been dissolved from under our feet, so make sure > + * to carefully check the state under the lock. > + * Return success when racing as if we dissolved the page ourselves. > + */ > + spin_lock_irq(&hugetlb_lock); > + if (PageHuge(page)) { > + head = compound_head(page); > + h = page_hstate(head); > + } else { > + spin_unlock(&hugetlb_lock); Should be be spin_unlock_irq(&hugetlb_lock); Other than that, it looks good. -- Mike Kravetz