Hi Andrew, Jerry found that current version crash with null pointer access, so could you drop this patch from -mm for a while? I'll try to understand the situation, and repost more robust one. Thanks, Naoya Horiguchi On Tue, May 28, 2019 at 09:13:17PM -0700, akpm@xxxxxxxxxxxxxxxxxxxx wrote: > > The patch titled > Subject: mm: hugetlb: soft-offline: fix wrong return value of soft offline > has been added to the -mm tree. Its filename is > mm-hugetlb-soft-offline-fix-wrong-return-value-of-soft-offline.patch > > This patch should soon appear at > http://ozlabs.org/~akpm/mmots/broken-out/mm-hugetlb-soft-offline-fix-wrong-return-value-of-soft-offline.patch > and later at > http://ozlabs.org/~akpm/mmotm/broken-out/mm-hugetlb-soft-offline-fix-wrong-return-value-of-soft-offline.patch > > Before you just go and hit "reply", please: > a) Consider who else should be cc'ed > b) Prefer to cc a suitable mailing list as well > c) Ideally: find the original patch on the mailing list and do a > reply-to-all to that, adding suitable additional cc's > > *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** > > The -mm tree is included into linux-next and is updated > there every 3-4 working days > > ------------------------------------------------------ > From: Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx> > Subject: mm: hugetlb: soft-offline: fix wrong return value of soft offline > > Soft offline events for hugetlb pages return -EBUSY when page migration > succeeded and dissolve_free_huge_page() failed, which can happen when > there're surplus hugepages. We should judge pass/fail of soft offline by > checking whether the raw error page was finally contained or not (i.e. > the result of set_hwpoison_free_buddy_page()), so this behavior is wrong. > > This problem was introduced by the following change of commit 6bc9b56433b76 > ("mm: fix race on soft-offlining"): > > if (ret > 0) > ret = -EIO; > } else { > - if (PageHuge(page)) > - dissolve_free_huge_page(page); > + /* > + * We set PG_hwpoison only when the migration source hugepage > + * was successfully dissolved, because otherwise hwpoisoned > + * hugepage remains on free hugepage list, then userspace will > + * find it as SIGBUS by allocation failure. That's not expected > + * in soft-offlining. > + */ > + ret = dissolve_free_huge_page(page); > + if (!ret) { > + if (set_hwpoison_free_buddy_page(page)) > + num_poisoned_pages_inc(); > + } > } > return ret; > } > > so a simple fix is to restore the PageHuge precheck, but my code reading > shows that we already have PageHuge check in dissolve_free_huge_page() > with hugetlb_lock, which is better place to check it. And currently > dissolve_free_huge_page() returns -EBUSY for !PageHuge but that's simply > wrong because that that case should be considered as success (meaning that > "the given hugetlb was already dissolved.") > > This change affects other callers of dissolve_free_huge_page(), which are > also cleaned up by this patch. > > Link: http://lkml.kernel.org/r/1558937200-18544-1-git-send-email-n-horiguchi@xxxxxxxxxxxxx > Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining") > Signed-off-by: Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx> > Reported-by: Chen, Jerry T <jerry.t.chen@xxxxxxxxx> > Cc: Michal Hocko <mhocko@xxxxxxxxxx> > Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx> > Cc: "Zhuo, Qiuxu" <qiuxu.zhuo@xxxxxxxxx> > Cc: <stable@xxxxxxxxxxxxxxx> [4.19+] > Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > --- > > mm/hugetlb.c | 15 +++++++++------ > mm/memory-failure.c | 7 +++---- > 2 files changed, 12 insertions(+), 10 deletions(-) > > --- a/mm/hugetlb.c~mm-hugetlb-soft-offline-fix-wrong-return-value-of-soft-offline > +++ a/mm/hugetlb.c > @@ -1519,7 +1519,12 @@ int dissolve_free_huge_page(struct page > int rc = -EBUSY; > > spin_lock(&hugetlb_lock); > - if (PageHuge(page) && !page_count(page)) { > + if (!PageHuge(page)) { > + rc = 0; > + goto out; > + } > + > + if (!page_count(page)) { > struct page *head = compound_head(page); > struct hstate *h = page_hstate(head); > int nid = page_to_nid(head); > @@ -1564,11 +1569,9 @@ int dissolve_free_huge_pages(unsigned lo > > for (pfn = start_pfn; pfn < end_pfn; pfn += 1 << minimum_order) { > page = pfn_to_page(pfn); > - if (PageHuge(page) && !page_count(page)) { > - rc = dissolve_free_huge_page(page); > - if (rc) > - break; > - } > + rc = dissolve_free_huge_page(page); > + if (rc) > + break; > } > > return rc; > --- a/mm/memory-failure.c~mm-hugetlb-soft-offline-fix-wrong-return-value-of-soft-offline > +++ a/mm/memory-failure.c > @@ -1733,6 +1733,8 @@ static int soft_offline_huge_page(struct > if (!ret) { > if (set_hwpoison_free_buddy_page(page)) > num_poisoned_pages_inc(); > + else > + ret = -EBUSY; > } > } > return ret; > @@ -1857,11 +1859,8 @@ static int soft_offline_in_use_page(stru > > static int soft_offline_free_page(struct page *page) > { > - int rc = 0; > - struct page *head = compound_head(page); > + int rc = dissolve_free_huge_page(page); > > - if (PageHuge(head)) > - rc = dissolve_free_huge_page(page); > if (!rc) { > if (set_hwpoison_free_buddy_page(page)) > num_poisoned_pages_inc(); > _ > > Patches currently in -mm which might be from n-horiguchi@xxxxxxxxxxxxx are > > mm-hugetlb-soft-offline-fix-wrong-return-value-of-soft-offline.patch > >