On Wed, May 19, 2021 at 03:32:17PM -0700, Mike Kravetz wrote: > On 5/18/21 4:12 PM, Naoya Horiguchi wrote: > > From: Naoya Horiguchi <naoya.horiguchi@xxxxxxx> > > > > When hugetlb page fault (under overcommitting situation) and > > memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following race: > > > > CPU0: CPU1: > > > > gather_surplus_pages() > > page = alloc_surplus_huge_page() > > memory_failure_hugetlb() > > get_hwpoison_page(page) > > __get_hwpoison_page(page) > > get_page_unless_zero(page) > > zero = put_page_testzero(page) > > VM_BUG_ON_PAGE(!zero, page) > > enqueue_huge_page(h, page) > > put_page(page) > > > > __get_hwpoison_page() only checks the page refcount before taking an > > additional one for memory error handling, which is wrong because there's > > a time window where compound pages have non-zero refcount during > > initialization. So make __get_hwpoison_page() check page status a bit > > more for hugetlb pages. > > > > Fixes: ead07f6a867b ("mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling") > > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@xxxxxxx> > > Reported-by: Muchun Song <songmuchun@xxxxxxxxxxxxx> > > Cc: stable@xxxxxxxxxxxxxxx # 5.12+ > > --- > > include/linux/hugetlb.h | 6 ++++++ > > mm/hugetlb.c | 15 +++++++++++++++ > > mm/memory-failure.c | 8 +++++++- > > 3 files changed, 28 insertions(+), 1 deletion(-) > > > > diff --git v5.13-rc2/include/linux/hugetlb.h v5.13-rc2_patched/include/linux/hugetlb.h > > index b92f25ccef58..790ae618548d 100644 > > --- v5.13-rc2/include/linux/hugetlb.h > > +++ v5.13-rc2_patched/include/linux/hugetlb.h > > @@ -149,6 +149,7 @@ bool hugetlb_reserve_pages(struct inode *inode, long from, long to, > > long hugetlb_unreserve_pages(struct inode *inode, long start, long end, > > long freed); > > bool isolate_huge_page(struct page *page, struct list_head *list); > > +int get_hwpoison_huge_page(struct page *page, bool *hugetlb); > > void putback_active_hugepage(struct page *page); > > void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason); > > void free_huge_page(struct page *page); > > @@ -339,6 +340,11 @@ static inline bool isolate_huge_page(struct page *page, struct list_head *list) > > return false; > > } > > > > +static inline int get_hwpoison_huge_page(struct page *page, bool *hugetlb) > > +{ > > + return 0; > > +} > > + > > static inline void putback_active_hugepage(struct page *page) > > { > > } > > diff --git v5.13-rc2/mm/hugetlb.c v5.13-rc2_patched/mm/hugetlb.c > > index 95918f410c0f..f138bae3e302 100644 > > --- v5.13-rc2/mm/hugetlb.c > > +++ v5.13-rc2_patched/mm/hugetlb.c > > @@ -5847,6 +5847,21 @@ bool isolate_huge_page(struct page *page, struct list_head *list) > > return ret; > > } > > > > +int get_hwpoison_huge_page(struct page *page, bool *hugetlb) > > +{ > > + int ret = 0; > > + > > + *hugetlb = false; > > + spin_lock_irq(&hugetlb_lock); > > + if (PageHeadHuge(page)) { > > + *hugetlb = true; > > + if (HPageFreed(page) || HPageMigratable(page)) > > + ret = get_page_unless_zero(page); > > + } > > + spin_unlock_irq(&hugetlb_lock); > > + return ret; > > +} > > + > > void putback_active_hugepage(struct page *page) > > { > > spin_lock_irq(&hugetlb_lock); > > diff --git v5.13-rc2/mm/memory-failure.c v5.13-rc2_patched/mm/memory-failure.c > > index 85ad98c00fd9..353c6177e489 100644 > > --- v5.13-rc2/mm/memory-failure.c > > +++ v5.13-rc2_patched/mm/memory-failure.c > > @@ -959,8 +959,14 @@ static int page_action(struct page_state *ps, struct page *p, > > static int __get_hwpoison_page(struct page *page) > > { > > struct page *head = compound_head(page); > > + int ret = 0; > > + bool hugetlb = false; > > + > > + ret = get_hwpoison_huge_page(head, &hugetlb); > > + if (hugetlb) > > + return ret; > > > > Hello Naoya, > > Thanks for your continued efforts. However, I believe the race still > exists. Unless I am mistaken, it is possible that page is in the hugetlb > allocation patch and racing with __get_hwpoison_page() as follows: Hi Mike, thank you for the investigation. > > CPU0: CPU1: > > gather_surplus_pages() > page = alloc_surplus_huge_page() > page = alloc_fresh_huge_page() > page = alloc_buddy_huge_page() > memory_failure_hugetlb() > get_hwpoison_page(page) > __get_hwpoison_page(page) > get_hwpoison_huge_page() > /* Note that PageHuge() > is false, so hugetlb > not set */ > PageTransHuge(head) false It seems that PageTransHuge returns true in this race condition because it simply checks PG_head flag. But anyway, get_page_unless_zero() is called for a hugetlb in this race, which is problematic. > prep_new_huge_page(page) > /* Now PageHuge() becomes true */ > get_page_unless_zero(page) > > I am not sure if it is possible to handle this race in the memory error > code. I think that __get_hwpoison_page() might not properly call get_page_unless_zero(). Looking at other callers of this function, most(*) of them are calling it in the context where a given page is pinned and in-use, and get_page_unless_zero() is used to detect the race with page freeing. In the current version, __get_hwpoison_page() calls get_page_unless_zero() without caring for such an assumption, which might be the root cause of the race with hugetlb page allocation. # (*) It seems to me that do_migrate_range() might have the same issue # around get_page_unless_zero(). So I think of inserting the check to comply with the assumption of get_hwpoison_huge_page() like below: ret = get_hwpoison_huge_page(head, &hugetlb); if (hugetlb) return ret; if (!PageLRU(head) && !__PageMovable(head)) return 0; if (PageTransHuge(head)) { ... } if (get_page_unless_zero(head)) { ... } return 0; The newly added checks should work to prevent the above race, then get_any_page() should retry and grab the page properly as a stable hugetlb page. > I can not think of a way to avoid potentially incrementing the > ref count on a hugetlb page as it is being created. There is nothing > synchronizing this in the hugetlb code. > > When Muchun first proposed a fix to the race, the idea was to catch the > race in the hugetlb code. Michal suggested that the memory error code > be more careful in modifying ref counts. I would wait a bit to see if > someone has a good idea how this can be done. We 'may' need to revisit > the approach suggested by Muchun. If the above approach is still broken, let's revisit Muchun's approach. Thanks, Naoya Horiguchi