On Wed, Jul 06, 2022 at 10:37:50AM +0800, Miaohe Lin wrote: > On 2022/7/4 9:33, Naoya Horiguchi wrote: > > From: Naoya Horiguchi <naoya.horiguchi@xxxxxxx> > > > > When handling memory error on a hugetlb page, the error handler tries to > > dissolve and turn it into 4kB pages. If it's successfully dissolved, > > PageHWPoison flag is moved to the raw error page, so that's all right. > > However, dissolve sometimes fails, then the error page is left as > > hwpoisoned hugepage. It's useful if we can retry to dissolve it to save > > healthy pages, but that's not possible now because the information about > > where the raw error pages is lost. > > > > Use the private field of a few tail pages to keep that information. The > > code path of shrinking hugepage pool uses this info to try delayed dissolve. > > In order to remember multiple errors in a hugepage, a singly-linked list > > originated from SUBPAGE_INDEX_HWPOISON-th tail page is constructed. Only > > simple operations (adding an entry or clearing all) are required and the > > list is assumed not to be very long, so this simple data structure should > > be enough. > > > > If we failed to save raw error info, the hwpoison hugepage has errors on > > unknown subpage, then this new saving mechanism does not work any more, > > so disable saving new raw error info and freeing hwpoison hugepages. > > > > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@xxxxxxx> > > --- ... > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index 66bb39e0fce8..ccd470f0194c 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -1535,17 +1535,15 @@ static void __update_and_free_page(struct hstate *h, struct page *page) > > if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported()) > > return; > > > > - if (hugetlb_vmemmap_restore(h, page)) { > > - spin_lock_irq(&hugetlb_lock); > > - /* > > - * If we cannot allocate vmemmap pages, just refuse to free the > > - * page and put the page back on the hugetlb free list and treat > > - * as a surplus page. > > - */ > > - add_hugetlb_page(h, page, true); > > - spin_unlock_irq(&hugetlb_lock); > > - return; > > - } > > + if (hugetlb_vmemmap_restore(h, page)) > > + goto fail; > > + > > + /* > > + * Move PageHWPoison flag from head page to the raw error pages, > > + * which makes any healthy subpages reusable. > > + */ > > + if (unlikely(PageHWPoison(page) && hugetlb_clear_page_hwpoison(page))) > > + goto fail; > > IIUC, HPageVmemmapOptimized must have been cleared via hugetlb_vmemmap_restore above. So > VM_BUG_ON_PAGE(!HPageVmemmapOptimized(page), page) in add_hugetlb_page will be triggered > if we go to fail here. add_hugetlb_page is expected to be called when we cannot allocate > vmemmap pages. Thanks a lot, you're right. I shouldn't simply factor the failure path with the goto label. I think that it's hard to undo hugetlb_vmemmap_restore(), so checking HPageRawHwpUnreliable() before hugetlb_vmemmap_restore(), then try to hugetlb_clear_page_hwpoison() after it (where tail pages are available). > > > > > for (i = 0; i < pages_per_huge_page(h); > > i++, subpage = mem_map_next(subpage, page, i)) { > > @@ -1566,6 +1564,16 @@ static void __update_and_free_page(struct hstate *h, struct page *page) > > } else { > > __free_pages(page, huge_page_order(h)); > > } > > + return; > > +fail: > > + spin_lock_irq(&hugetlb_lock); > > + /* > > + * If we cannot allocate vmemmap pages or cannot identify raw hwpoison > > + * subpages reliably, just refuse to free the page and put the page > > + * back on the hugetlb free list and treat as a surplus page. > > + */ > > + add_hugetlb_page(h, page, true); > > + spin_unlock_irq(&hugetlb_lock); > > } > > > > /* > > @@ -2109,15 +2117,6 @@ int dissolve_free_huge_page(struct page *page) > > */ > > rc = hugetlb_vmemmap_restore(h, head); > > if (!rc) { > > - /* > > - * Move PageHWPoison flag from head page to the raw > > - * error page, which makes any subpages rather than > > - * the error page reusable. > > - */ > > - if (PageHWPoison(head) && page != head) { > > - SetPageHWPoison(page); > > - ClearPageHWPoison(head); > > - } > > update_and_free_page(h, head, false); > > } else { > > spin_lock_irq(&hugetlb_lock); > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > > index c9931c676335..53bf7486a245 100644 > > --- a/mm/memory-failure.c > > +++ b/mm/memory-failure.c > > @@ -1664,6 +1664,82 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index, > > EXPORT_SYMBOL_GPL(mf_dax_kill_procs); > > #endif /* CONFIG_FS_DAX */ > > > > +/* > > + * Struct raw_hwp_page represents information about "raw error page", > > + * constructing singly linked list originated from ->private field of > > + * SUBPAGE_INDEX_HWPOISON-th tail page. > > + */ > > +struct raw_hwp_page { > > + struct llist_node node; > > + struct page *page; > > +}; > > + > > +static inline struct llist_head *raw_hwp_list_head(struct page *hpage) > > +{ > > + return (struct llist_head *)&page_private(hpage + SUBPAGE_INDEX_HWPOISON); > > +} > > + > > +static inline int hugetlb_set_page_hwpoison(struct page *hpage, > > + struct page *page) > > +{ > > + struct llist_head *head; > > + struct raw_hwp_page *raw_hwp; > > + struct llist_node *t, *tnode; > > + int ret; > > + > > + /* > > + * Once the hwpoison hugepage has lost reliable raw error info, > > + * there is little meaning to keep additional error info precisely, > > + * so skip to add additional raw error info. > > + */ > > + if (HPageRawHwpUnreliable(hpage)) > > + return -EHWPOISON; > > If we return here, num_poisoned_pages can't reflect the all hwpoisoned hugepages? No, it can't now. Currently we try to (and fail) count only the number hwpoison subpages with raw_hwp_info. If we want to track all corrupted pages (including hwpoison subpages without raw_hwp_info), maybe running the following part first in this function could make it better. ret = TestSetPageHWPoison(hpage) ? -EHWPOISON : 0; /* the first error event will be counted in action_result(). */ if (ret) num_poisoned_pages_inc(); But I like the option you suggest below. > > > + head = raw_hwp_list_head(hpage); > > + llist_for_each_safe(tnode, t, head->first) { > > + struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node); > > + > > + if (p->page == page) > > + return -EHWPOISON; > > + } > > + > > + ret = TestSetPageHWPoison(hpage) ? -EHWPOISON : 0; > > + /* the first error event will be counted in action_result(). */ > > + if (ret) > > + num_poisoned_pages_inc(); > > + > > + raw_hwp = kmalloc(sizeof(struct raw_hwp_page), GFP_ATOMIC); > > + if (raw_hwp) { > > + raw_hwp->page = page; > > + llist_add(&raw_hwp->node, head); > > IMHO, we might need to do num_poisoned_pages_inc here because we decrement the > num_poisoned_pages according to the llist length. Yes, if we'd like to count only hwpoisoned subpages with raw_hwp_info, doing num_poisoned_pages_inc here is fine. > > + } else { > > + /* > > + * Failed to save raw error info. We no longer trace all > > + * hwpoisoned subpages, and we need refuse to free/dissolve > > + * this hwpoisoned hugepage. > > + */ > > + SetHPageRawHwpUnreliable(hpage); > > + } > > + return ret; > > +} > > + > > +inline int hugetlb_clear_page_hwpoison(struct page *hpage) > > off-the-topic: Is "inline" needed here? I see hugetlb_clear_page_hwpoison is "extern" above. Maybe not, this code is not performance-sensitive and actually this inline is a leftover of updates in the previous versions. I'll remove this. > > > +{ > > + struct llist_head *head; > > + struct llist_node *t, *tnode; > > + > > + if (!HPageRawHwpUnreliable(hpage)) > > + ClearPageHWPoison(hpage); > > + head = raw_hwp_list_head(hpage); > > + llist_for_each_safe(tnode, t, head->first) { > > + struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node); > > + > > + SetPageHWPoison(p->page); > > IMHO, in HPageRawHwpUnreliable(hpage) case, it's better not to do SetPageHWPoison here. > Because hugepage won't be dissolved and thus we cannot write any data to some tail struct > pages if HugeTLB Vmemmap Optimization is enabled. Freeing the memory here should be enough. This is a good point, too. Current version surely does not work with HVO, so I think of simply giving up clearing hwpoison when HPageVmemmapOptimized is true. And I should leave some inline comment about this. > > + kfree(p); > > + } > > + llist_del_all(head); > > + return 0; > > +} > > I thought num_poisoned_pages_dec is missing and return value is unneeded. But this is changed > in next patch. So it should be fine here. OK. > > > + > > /* > > * Called from hugetlb code with hugetlb_lock held. > > * > > @@ -1698,7 +1774,7 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags) > > goto out; > > } > > > > - if (TestSetPageHWPoison(head)) { > > + if (hugetlb_set_page_hwpoison(head, page)) { > > ret = -EHWPOISON; > > goto out; > > } > > @@ -1751,7 +1827,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb > > lock_page(head); > > > > if (hwpoison_filter(p)) { > > - ClearPageHWPoison(head); > > + hugetlb_clear_page_hwpoison(head); > > res = -EOPNOTSUPP; > > goto out; > > } > > > > Many thanks for your hard work. :) Thanks for the detailed review and feedback. - Naoya Horiguchi