On Tue, Apr 26, 2022 at 07:41:52PM +0800, Miaohe Lin wrote: > On 2022/4/25 21:55, HORIGUCHI NAOYA(堀口 直也) wrote: > > On Fri, Apr 22, 2022 at 02:53:48PM +0800, Miaohe Lin wrote: > >> On 2022/4/22 6:04, Mike Kravetz wrote: > >>> On 4/21/22 07:20, David Hildenbrand wrote: > >>>> On 21.04.22 15:51, Miaohe Lin wrote: > >>>>> When trying to offline pages, HWPoisoned hugepage is migrated without > >>>>> checking PageHWPoison first. So corrupted data could be consumed. Fix > >>>>> it by deferring isolate_huge_page until PageHWPoison is handled. > >>>>> > >>>> > >>>> CCing Oscar, Mike and Naoya > >>>> > >>>>> Signed-off-by: Miaohe Lin <linmiaohe@xxxxxxxxxx> > >>>>> --- > >>>>> mm/memory_hotplug.c | 11 +++++++---- > >>>>> 1 file changed, 7 insertions(+), 4 deletions(-) > >>>>> > >>>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > >>>>> index 4c6065e5d274..093f85ec5c5c 100644 > >>>>> --- a/mm/memory_hotplug.c > >>>>> +++ b/mm/memory_hotplug.c > >>>>> @@ -1600,11 +1600,9 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) > >>>>> folio = page_folio(page); > >>>>> head = &folio->page; > >>>>> > >>>>> - if (PageHuge(page)) { > >>>>> + if (PageHuge(page)) > >>>>> pfn = page_to_pfn(head) + compound_nr(head) - 1; > >>>>> - isolate_huge_page(head, &source); > >>>>> - continue; > >>>>> - } else if (PageTransHuge(page)) > >>>>> + else if (PageTransHuge(page)) > >>>>> pfn = page_to_pfn(head) + thp_nr_pages(page) - 1; > >>>>> > >>>>> /* > >>>>> @@ -1622,6 +1620,11 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) > >>>>> continue; > >>>>> } > >>>>> > >>>>> + if (PageHuge(page)) { > >>>>> + isolate_huge_page(head, &source); > >>>>> + continue; > >>>>> + } > >>>>> + > >>>>> if (!get_page_unless_zero(page)) > >>>>> continue; > >>>>> /* > >>>> > >>>> The problem statement makes sense to me but I am not sure about the > >>>> details if we run into the "PageHWPoison" path with a huge page. I have > >>>> the gut feeling that we have to do more for huge pages in the > >>>> PageHWPoison() path, because we might be dealing with a free huge page > >>>> after unmap succeeds. I might be wrong. > >>>> > >>> > >>> Thinking about memory errors always makes my head hurt :) > >> > >> Me too. ;) > >> > >>> > >>> What 'state' could a poisoned hugetlb page be in here? > >>> - Mapped into a process address space? > >>> - On the hugetlb free lists? > > > > There seems at least 2 cases where a hwpoisoned hugetlb page keeps undissolved. > > (1) Hwpoisoned hugepage in free hugepage list > > (2) Hwpoisoned hugepage belonging to a hugetlbfs file > > > >>> > >>> IIUC, when poisoning a hugetlb page we try to dissolve those that are > >>> free and unmap those which are mapped. So, this means those operations > >>> must have failed for some reason. Is that correct? > > > > Yes, (1) is made by failure in error handling, but (2) can be made even > > when memory error on in-use hugepage is successfully handled. > > Could you please explain 2 more detailed? IIUC, in-use hugepage belonging to a hugetlbfs > file will be removed from pagecache via hugetlbfs_error_remove_page. So when successfully > handled, hugepage->mapping will be NULL and thus not belonging to a hugetlbfs file. Or am > I miss something? No, you're right. Sorry for my misunderstanding about who takes the last refcount in case (2). > > > > >> > >> I think you're right. > >> > >> BTW: Now that PageHWPoison is set under hugetlb_lock and HPageFreed is also checked > >> under that lock, it should be more likely to handle the hugetlb page on the free lists > >> successfully because the possible race with hugetlb page allocation or demotion can be > >> prevented by PageHWPoison flag. So it is more likely to be a mapped hugetlb page here. > >> But that doesn't matter. > >> > >>> > >>> Now, IF the above is true this implies there is a poisoned page somewhere > >>> within the hugetlb page. But, poison is only marked in the head page. > >>> So, we do not really know where within the page the actual error exists. > >>> Is my reasoning still correct? > >>> > >> > >> IIRC, Naoya once proposed to store the poisoned page info into subpage field. > > > > Yes, it's important to keep the offset of raw page. I have a "storing raw > > error info" patch for this, but it's still immature to submit. > > > >> > >>> If my reasoning is correct, then I am not sure what we can do here. > >>> The code to handle poison is: > >>> > >>> if (PageHWPoison(page)) { > >>> if (WARN_ON(folio_test_lru(folio))) > >>> folio_isolate_lru(folio); > >>> if (folio_mapped(folio)) > >>> try_to_unmap(folio, TTU_IGNORE_MLOCK); > >>> continue; > >>> } > >>> > >>> My first observation is that if a hugetlb page is passed to try_to_unmap > >>> as above we may BUG. This is because we need to hold i_mmap_rwsem in > >>> write mode because of the possibility of calling huge_pmd_unshare. :( > >> > >> I'm sorry. I missed that case. :( > > > > OK, so the above check should not be called for hugetlb pages. > > > >> > >>> > >>> I 'think' try_to_unmap could succeed on a poisoned hugetlb page once we > >>> add correct locking. So, the page would be unmapped. I do not think anyone > >>> is holding a ref, so the last unmap should put the hugetlb page on the > >>> free list. Correct? We will not migrate the page, but ... > > > > That seems depends on whether the hugepage is anonymous or file. > > If we consider anonymous hugepage, the above statement is correct. > > > > As for what we can do, for (1) we can simply skip the hwpoisoned hugepage > > in do_migrate_range() (i.e. no need to call isolate_huge_page()), because in > > the outer while-loop in offline_pages() we call dissolve_free_huge_pages() > > so free hugepage can be handled here. The above "storing raw error info" > > patch will also enable dissolve_free_huge_pages() to handle hwpoisoned free > > hugepage (by moving PG_hwpoison flag to the raw error page), so I assume > > to have the patch. > > Yes, I think (1) will be fixed this way. > > > > > As for (2), I think that making such hugepages unmovable (by clearing > > HPageMigratable) could work. With that change, scan_movable_pages() detects > > unmovable pages so do_migrate_range is not called for the hugepages. > > Moreover, scan_movable_pages() should return -EBUSY if an unmovable page is found, > > so offline_pages() fails without trying to touch the hwpoisoned hugepage. > > Yes, it seems work but this will lead to the memory offline fails. Could this be too overkill? I think so. And I found (in writing/testing patches) that releasing the last refcount in dissolve_free_huge_pages() in offline_pages() could work, so offline_pages() does not have to fail. So that's better than what I wrote above. > > > Anyway the reported problem (try to migrate hwpoisoned hugepage) would to be solved. > > > > # Maybe we might optionally try to free the hwpoisoned hugepage by removing > > # the affected hugetlbfs file, but it's not unclear to me taht that's acceptable > > # (a file is gone when doing memory hotremove?). > > > >> > >> Might hugetlb page still be in the pagecache? But yes, the last unmap could put > >> the hugetlb page on the free list. > >> > >>> > >>> After the call to do_migrate_range() in offline_pages, we will call > >>> dissolve_free_huge_pages. For each hugetlb page, dissolve_free_huge_pages > >>> will call dissolve_free_huge_page likely passing in the 'head' page. > >>> When dissolve_free_huge_page is called for poisoned hugetlb pages from > >>> the memory error handling code, it passes in the sub page which contains > >>> the memory error. Before freeing the hugetlb page to buddy, there is this > >>> code: > >>> > >>> /* > >>> * Move PageHWPoison flag from head page to the raw > >>> * error page, which makes any subpages rather than > >>> * the error page reusable. > >>> */ > >>> if (PageHWPoison(head) && page != head) { > >>> SetPageHWPoison(page); > >>> ClearPageHWPoison(head); > >>> } > >>> update_and_free_page(h, head, false) > >>> > >>> As previously mentioned, outside the memory error handling code we do > >>> not know what page within the hugetlb page contains poison. So, this > >>> will likely result with a page with errors on the free list and an OK > >>> page marked with error. > >> > >> Yes, this is possible. It's really a pity. Could we just reject to dissolve the > >> hugetlb page and keep the hugetlb page around? This will waste some healthy subpages > >> but can do the right things? > > > > Could you let me try writing patches in my approach? > > I'll try to submit the fix with the "storing raw error info" patch. > > Sure, I will wait for that. Many thanks for your hard work! :) I have almost prepared this, so will send this afternoon. > > > >> > >>> > >>> In other places, we 'bail' if we encounter a hugetlb page with poison. > >>> It would be unfortunate to prevent memory offline if the range contains > >>> a hugetlb page with poison. After all, offlining a section with poison > >>> make sense. > >>> > >> > >> IIUC, a (hugetlb) page with poison could be offlined anytime even if it's still in the > >> pagecache, swapcache, i.e. still be referenced by somewhere, because memory offline will > >> just offline the poisoned page while ignoring the page refcnt totally. I can't figure > >> out a easy way to fix it yet. :( > > > > Yes, that's finally expected behavior, but unlike normal pagecache we need > > pay extra attention for hugetlbfs file because it's not storage-backed. > > > > Thank you for valuable comments, everyone. > > BTW: There is another possible problem about hwpoison. Think about the below scene: > > offline_pages > do { > scan_movable_pages > __PageMovable /* Page is also hwpoisoned. */ > goto found; /* So ret is always 0. */ > do_migrate_range > if (PageHWPoison(page)) /* PageMovable hwpoisoned page is not handled. */ > continue; > } while (!ret); /* We will meet that page again and again. */ > > So we might busy-loop until a signal is pending. Or am I miss something? And I think this could be fixed by > deferring set hwpoisoned flag until page refcnt is successfully incremented for normal page in memory_failure() > which is already in your to-do list. So this possible issue could be left alone now? Thank you for finding this. Maybe I saw similar issue (infinite loop) in testing patches, but not sure about the detail. I'll learn it more. Thanks, Naoya Horiguchi