On Mon, Jul 11, 2022 at 11:26:34AM +0800, Miaohe Lin wrote: > On 2022/7/8 13:36, Naoya Horiguchi wrote: > > From: Naoya Horiguchi <naoya.horiguchi@xxxxxxx> > > > > When handling memory error on a hugetlb page, the error handler tries to > > dissolve and turn it into 4kB pages. If it's successfully dissolved, > > PageHWPoison flag is moved to the raw error page, so that's all right. > > However, dissolve sometimes fails, then the error page is left as > > hwpoisoned hugepage. It's useful if we can retry to dissolve it to save > > healthy pages, but that's not possible now because the information about > > where the raw error pages is lost. > > > > Use the private field of a few tail pages to keep that information. The > > code path of shrinking hugepage pool uses this info to try delayed dissolve. > > In order to remember multiple errors in a hugepage, a singly-linked list > > originated from SUBPAGE_INDEX_HWPOISON-th tail page is constructed. Only > > simple operations (adding an entry or clearing all) are required and the > > list is assumed not to be very long, so this simple data structure should > > be enough. > > > > If we failed to save raw error info, the hwpoison hugepage has errors on > > unknown subpage, then this new saving mechanism does not work any more, > > so disable saving new raw error info and freeing hwpoison hugepages. > > > > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@xxxxxxx> > > Reported-by: kernel test robot <lkp@xxxxxxxxx> > > --- > > v4 -> v5: > > - fixed build error (reported by kernel test robot). > > - do not try to undo remove_hugetlb_page() when HPageRawHwpUnreliable is true, > > - check HPageRawHwpUnreliable() before hugetlb_vmemmap_restore(), > > - call num_poisoned_pages_inc() in hugetlb_set_page_hwpoison() when kalloc > > succeeds, > > - remove "inline" in the definition of hugetlb_clear_page_hwpoison(). > > > > v3 -> v4: > > - resolve conflict with "mm: hugetlb_vmemmap: improve hugetlb_vmemmap > > code readability", use hugetlb_vmemmap_restore() instead of > > hugetlb_vmemmap_alloc(). > > > > v2 -> v3: > > - remove duplicate "return ret" lines, > > - use GFP_ATOMIC instead of GFP_KERNEL, > > - introduce HPageRawHwpUnreliable pseudo flag (suggested by Muchun), > > - hugetlb_clear_page_hwpoison removes raw_hwp_page list even if > > HPageRawHwpUnreliable is true, (by Miaohe) > > > > v1 -> v2: > > - support hwpoison hugepage with multiple errors, > > - moved the new interface functions to mm/memory-failure.c, > > - define additional subpage index SUBPAGE_INDEX_HWPOISON_UNRELIABLE, > > - stop freeing/dissolving hwpoison hugepages with unreliable raw error info, > > - drop hugetlb_clear_page_hwpoison() in dissolve_free_huge_page() because > > that's done in update_and_free_page(), > > - move setting/clearing PG_hwpoison flag to the new interfaces, > > - checking already hwpoisoned or not on a subpage basis. > > > > ChangeLog since previous post on 4/27: > > - fixed typo in patch description (by Miaohe) > > - fixed config value in #ifdef statement (by Miaohe) > > - added sentences about "multiple hwpoison pages" scenario in patch > > description > > --- > > include/linux/hugetlb.h | 18 +++++++++- > > mm/hugetlb.c | 32 ++++++++++++----- > > mm/memory-failure.c | 79 +++++++++++++++++++++++++++++++++++++++-- > > 3 files changed, 116 insertions(+), 13 deletions(-) > > > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > > index 6d0620edf0a6..6fd128b80d57 100644 > > --- a/include/linux/hugetlb.h > > +++ b/include/linux/hugetlb.h > > @@ -42,6 +42,9 @@ enum { > > SUBPAGE_INDEX_CGROUP, /* reuse page->private */ > > SUBPAGE_INDEX_CGROUP_RSVD, /* reuse page->private */ > > __MAX_CGROUP_SUBPAGE_INDEX = SUBPAGE_INDEX_CGROUP_RSVD, > > +#endif > > +#ifdef CONFIG_MEMORY_FAILURE > > + SUBPAGE_INDEX_HWPOISON, > > #endif > > __NR_USED_SUBPAGE, > > }; > > @@ -551,7 +554,7 @@ generic_hugetlb_get_unmapped_area(struct file *file, unsigned long addr, > > * Synchronization: Initially set after new page allocation with no > > * locking. When examined and modified during migration processing > > * (isolate, migrate, putback) the hugetlb_lock is held. > > - * HPG_temporary - - Set on a page that is temporarily allocated from the buddy > > + * HPG_temporary - Set on a page that is temporarily allocated from the buddy > > * allocator. Typically used for migration target pages when no pages > > * are available in the pool. The hugetlb free page path will > > * immediately free pages with this flag set to the buddy allocator. > > @@ -561,6 +564,8 @@ generic_hugetlb_get_unmapped_area(struct file *file, unsigned long addr, > > * HPG_freed - Set when page is on the free lists. > > * Synchronization: hugetlb_lock held for examination and modification. > > * HPG_vmemmap_optimized - Set when the vmemmap pages of the page are freed. > > + * HPG_raw_hwp_unreliable - Set when the hugetlb page has a hwpoison sub-page > > + * that is not tracked by raw_hwp_page list. > > */ > > enum hugetlb_page_flags { > > HPG_restore_reserve = 0, > > @@ -568,6 +573,7 @@ enum hugetlb_page_flags { > > HPG_temporary, > > HPG_freed, > > HPG_vmemmap_optimized, > > + HPG_raw_hwp_unreliable, > > __NR_HPAGEFLAGS, > > }; > > > > @@ -614,6 +620,7 @@ HPAGEFLAG(Migratable, migratable) > > HPAGEFLAG(Temporary, temporary) > > HPAGEFLAG(Freed, freed) > > HPAGEFLAG(VmemmapOptimized, vmemmap_optimized) > > +HPAGEFLAG(RawHwpUnreliable, raw_hwp_unreliable) > > > > #ifdef CONFIG_HUGETLB_PAGE > > > > @@ -796,6 +803,15 @@ extern int dissolve_free_huge_page(struct page *page); > > extern int dissolve_free_huge_pages(unsigned long start_pfn, > > unsigned long end_pfn); > > > > +#ifdef CONFIG_MEMORY_FAILURE > > +extern int hugetlb_clear_page_hwpoison(struct page *hpage); > > +#else > > +static inline int hugetlb_clear_page_hwpoison(struct page *hpage) > > +{ > > + return 0; > > +} > > +#endif > > + > > #ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION > > #ifndef arch_hugetlb_migration_supported > > static inline bool arch_hugetlb_migration_supported(struct hstate *h) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index 77119d93a0f9..3956494cc5fb 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -1442,6 +1442,15 @@ static void __remove_hugetlb_page(struct hstate *h, struct page *page, > > h->surplus_huge_pages_node[nid]--; > > } > > > > + /* > > + * This leaves HPageRawHwpUnreliable pages as leaked hugepages, not > > + * as leaked generic-compound pages. Otherwise page_mapped() or > > + * folio_mapped() gets slow because for-loop for each subpage is > > + * called. > > + */ > > + if (HPageRawHwpUnreliable(page)) > > + return; > > + > > This patch looks good to me with below several possible problems: > > Should "nr_huge_pages" and "nr_huge_pages_node" be adjusted too? If it's called from dissolve_free_huge_page > and hugetlb_vmemmap_restore fails, add_hugetlb_page will be called: > > add_hugetlb_page: > ... > h->nr_huge_pages++; > h->nr_huge_pages_node[nid]++; > ^^^^^^^^1. the "nr_huge_pages" and "nr_huge_pages_node" might be incorrect? > ... > zeroed = put_page_testzero(page); > ^^^^^^^^2. VM_BUG_ON_PAGE(page_ref_count(page) == 0, page); will be triggered? > > Or am I miss something? No, this code breaks the specific case, so I'd like to simply drop this if. HPageRawHwpUnreliable hugepage should be very rare, and calling page_mapped() for such a leaked page should be less common, so the impact of the slowdown should be minimal. Thanks, Naoya Horiguchi