On Fri 26-02-21 09:35:10, Michal Hocko wrote: > On Mon 22-02-21 14:51:36, Oscar Salvador wrote: > > alloc_contig_range will fail if it ever sees a HugeTLB page within the > > range we are trying to allocate, even when that page is free and can be > > easily reallocated. > > This has proved to be problematic for some users of alloc_contic_range, > > e.g: CMA and virtio-mem, where those would fail the call even when those > > pages lay in ZONE_MOVABLE and are free. > > > > We can do better by trying to replace such page. > > > > Free hugepages are tricky to handle so as to no userspace application > > notices disruption, we need to replace the current free hugepage with > > a new one. > > > > In order to do that, a new function called alloc_and_dissolve_huge_page > > is introduced. > > This function will first try to get a new fresh hugepage, and if it > > succeeds, it will replace the old one in the free hugepage pool. > > > > All operations are being handled under hugetlb_lock, so no races are > > possible. The only exception is when page's refcount is 0, but it still > > has not been flagged as PageHugeFreed. > > I think it would be helpful to call out that specific case explicitly > here. I can see only one scenario (are there more?) > __free_huge_page() isolate_or_dissolve_huge_page > PageHuge() == T > alloc_and_dissolve_huge_page > alloc_fresh_huge_page() > spin_lock(hugetlb_lock) > // PageHuge() && !PageHugeFreed && > // !PageCount() > spin_unlock(hugetlb_lock) > spin_lock(hugetlb_lock) > 1) update_and_free_page > PageHuge() == F > __free_pages() > 2) enqueue_huge_page > SetPageHugeFreed() > spin_unlock(&hugetlb_lock) > > > In this case we retry as the window race is quite small and we have high > > chances to succeed next time. > > > > With regard to the allocation, we restrict it to the node the page belongs > > to with __GFP_THISNODE, meaning we do not fallback on other node's zones. > > > > Note that gigantic hugetlb pages are fenced off since there is a cyclic > > dependency between them and alloc_contig_range. > > > > Signed-off-by: Oscar Salvador <osalvador@xxxxxxx> > > Thanks this looks much better than the initial version. One nit below. > Acked-by: Michal Hocko <mhocko@xxxxxxxx> Btw. if David has some numbers it would be great to add them to the changelog. -- Michal Hocko SUSE Labs