On Thu, Jul 16, 2020 at 02:38:06PM +0200, Oscar Salvador wrote: > This patch changes the way we set and handle in-use poisoned pages. > Until now, poisoned pages were released to the buddy allocator, trusting > that the checks that take place prior to deliver the page to its end > user would act as a safe net and would skip that page. > > This has proved to be wrong, as we got some pfn walkers out there, like > compaction, that all they care is the page to be PageBuddy and be in a > freelist. > Although this might not be the only user, having poisoned pages > in the buddy allocator seems a bad idea as we should only have > free pages that are ready and meant to be used as such. > > Before explaining the taken approach, let us break down the kind > of pages we can soft offline. > > - Anonymous THP (after the split, they end up being 4K pages) > - Hugetlb > - Order-0 pages (that can be either migrated or invalited) > > * Normal pages (order-0 and anon-THP) > > - If they are clean and unmapped page cache pages, we invalidate > then by means of invalidate_inode_page(). > - If they are mapped/dirty, we do the isolate-and-migrate dance. > > Either way, do not call put_page directly from those paths. > Instead, we keep the page and send it to page_set_poison to perform the > right handling. > > Among other things, page_set_poison() sets the HWPoison flag and does the last > put_page. > This call to put_page is mainly to be able to call __page_cache_release, > since this function is not exported. > > Down the chain, we placed a check for HWPoison page in > free_pages_prepare, that just skips any poisoned page, so those pages > do not end up either in a pcplist or in buddy-freelist. > > After that, we set the refcount on the page to 1 and we increment > the poisoned pages counter. > > We could do as we do for free pages: > 1) wait until the page hits buddy's freelists > 2) take it off > 3) flag it > > The problem is that we could race with an allocation, so by the time we > want to take the page off the buddy, the page is already allocated, so we > cannot soft-offline it. > This is not fatal of course, but if it is better if we can close the race > as does not require a lot of code. > > * Hugetlb pages > > - We isolate-and-migrate them > > There is no magic in here, we just isolate and migrate them. > A new set of internal functions have been made to flag a hugetlb page as > poisoned (SetPageHugePoisoned(), PageHugePoisoned(), ClearPageHugePoisoned()) > This allows us to flag the page when we migrate it, back in > move_hugetlb_state(). > > Later on we check whether the page is poisoned in __free_huge_page, > and we bail out in that case before sending the page to e.g: active > free list. > This gives us full control of the page, and we can handle it > page_handle_poison(). > > In other words, we do not allow migrated hugepages to get back to the > freelists. > > Since now the page has no user and has been migrated, we can call > dissolve_free_huge_page, which will end up calling update_and_free_page. > In update_and_free_page(), we check for the page to be poisoned. > If it so, we handle it as we handle gigantic pages, i.e: we break down > the page in order-0 pages and free them one by one. > Doing so, allows us for free_pages_prepare to skip poisoned pages. > > Because of the way we handle now in-use pages, we no longer need the > put-as-isolation-migratetype dance, that was guarding for poisoned pages > to end up in pcplists. I ran Quan Cai's test program (https://github.com/cailca/linux-mm) on a small (4GB memory) VM, and weiredly found that (1) the target hugepages are not always dissolved and (2) dissovled hugetpages are still counted in "HugePages_Total:". See below: $ ./random 1 - start: migrate_huge_offline - use NUMA nodes 0,1. - mmap and free 8388608 bytes hugepages on node 0 - mmap and free 8388608 bytes hugepages on node 1 madvise: Cannot allocate memory $ cat /proc/meminfo MemTotal: 4026772 kB MemFree: 976300 kB MemAvailable: 892840 kB Buffers: 20936 kB Cached: 99768 kB SwapCached: 5904 kB Active: 84332 kB Inactive: 116328 kB Active(anon): 27944 kB Inactive(anon): 68524 kB Active(file): 56388 kB Inactive(file): 47804 kB Unevictable: 7532 kB Mlocked: 0 kB SwapTotal: 2621436 kB SwapFree: 2609844 kB Dirty: 56 kB Writeback: 0 kB AnonPages: 81764 kB Mapped: 54348 kB Shmem: 8948 kB KReclaimable: 22744 kB Slab: 52056 kB SReclaimable: 22744 kB SUnreclaim: 29312 kB KernelStack: 3888 kB PageTables: 2804 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 3260612 kB Committed_AS: 828196 kB VmallocTotal: 34359738367 kB VmallocUsed: 19260 kB VmallocChunk: 0 kB Percpu: 5120 kB HardwareCorrupted: 5368 kB AnonHugePages: 18432 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB FileHugePages: 0 kB FilePmdMapped: 0 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 1342 // still counted as hugetlb pages. HugePages_Free: 0 // all hugepage are still allocated (or leaked?) HugePages_Rsvd: 0 HugePages_Surp: 762 // some are counted in surplus. Hugepagesize: 2048 kB Hugetlb: 2748416 kB DirectMap4k: 112480 kB DirectMap2M: 4081664 kB $ page-types -b hwpoison flags page-count MB symbolic-flags long-symbolic-flags 0x0000000000080008 421 1 ___U_______________X_______________________ uptodate,hwpoison 0x00000000000a8018 1 0 ___UD__________H_G_X_______________________ uptodate,dirty,compound_head,huge,hwpoison 0x00000000000a801c 920 3 __RUD__________H_G_X_______________________ referenced,uptodate,dirty,compound_head,huge,hwpoison total 1342 5 This means that some hugepages are dissolved, but the others not, maybe which is not desirable. I'll dig this more later but just let me share at first. A few minor comment below ... > > Signed-off-by: Oscar Salvador <osalvador@xxxxxxxx> > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@xxxxxxx> > --- > include/linux/page-flags.h | 5 ---- > mm/hugetlb.c | 60 +++++++++++++++++++++++++++++++++----- > mm/memory-failure.c | 53 +++++++++++++-------------------- > mm/migrate.c | 11 ++----- > mm/page_alloc.c | 38 +++++++----------------- > 5 files changed, 86 insertions(+), 81 deletions(-) > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 01baf6d426ff..2ac8bfa0cf20 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -426,13 +426,8 @@ PAGEFLAG(HWPoison, hwpoison, PF_ANY) > TESTSCFLAG(HWPoison, hwpoison, PF_ANY) > #define __PG_HWPOISON (1UL << PG_hwpoison) > extern bool take_page_off_buddy(struct page *page); > -extern bool set_hwpoison_free_buddy_page(struct page *page); > #else > PAGEFLAG_FALSE(HWPoison) > -static inline bool set_hwpoison_free_buddy_page(struct page *page) > -{ > - return 0; > -} > #define __PG_HWPOISON 0 > #endif > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 7badb01d15e3..1c6397936512 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -29,6 +29,7 @@ > #include <linux/numa.h> > #include <linux/llist.h> > #include <linux/cma.h> > +#include <linux/migrate.h> > > #include <asm/page.h> > #include <asm/pgalloc.h> > @@ -1209,9 +1210,26 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed) > ((node = hstate_next_node_to_free(hs, mask)) || 1); \ > nr_nodes--) > > -#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE > -static void destroy_compound_gigantic_page(struct page *page, > - unsigned int order) > +static inline bool PageHugePoisoned(struct page *page) > +{ > + if (!PageHuge(page)) > + return false; > + > + return (unsigned long)page[3].mapping == -1U; > +} > + > +static inline void SetPageHugePoisoned(struct page *page) > +{ > + page[3].mapping = (void *)-1U; > +} > + > +static inline void ClearPageHugePoisoned(struct page *page) > +{ > + page[3].mapping = NULL; > +} > + > +static void destroy_compound_gigantic_page(struct hstate *h, struct page *page, > + unsigned int order) > { > int i; > int nr_pages = 1 << order; > @@ -1222,14 +1240,19 @@ static void destroy_compound_gigantic_page(struct page *page, > atomic_set(compound_pincount_ptr(page), 0); > > for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) { > + if (!hstate_is_gigantic(h)) > + p->mapping = NULL; > clear_compound_head(p); > set_page_refcounted(p); > } > > + if (PageHugePoisoned(page)) > + ClearPageHugePoisoned(page); > set_compound_order(page, 0); > __ClearPageHead(page); > } > > +#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE > static void free_gigantic_page(struct page *page, unsigned int order) > { > /* > @@ -1284,13 +1307,16 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask, > return NULL; > } > static inline void free_gigantic_page(struct page *page, unsigned int order) { } > -static inline void destroy_compound_gigantic_page(struct page *page, > - unsigned int order) { } > +static inline void destroy_compound_gigantic_page(struct hstate *h, > + struct page *page, > + unsigned int order) { } > #endif > > static void update_and_free_page(struct hstate *h, struct page *page) > { > int i; > + bool poisoned = PageHugePoisoned(page); > + unsigned int order = huge_page_order(h); > > if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported()) > return; > @@ -1313,11 +1339,21 @@ static void update_and_free_page(struct hstate *h, struct page *page) > * we might block in free_gigantic_page(). > */ > spin_unlock(&hugetlb_lock); > - destroy_compound_gigantic_page(page, huge_page_order(h)); > - free_gigantic_page(page, huge_page_order(h)); > + destroy_compound_gigantic_page(h, page, order); > + free_gigantic_page(page, order); > spin_lock(&hugetlb_lock); > } else { > - __free_pages(page, huge_page_order(h)); > + if (unlikely(poisoned)) { > + /* > + * If the hugepage is poisoned, do as we do for > + * gigantic pages and free the pages as order-0. > + * free_pages_prepare will skip over the poisoned ones. > + */ > + destroy_compound_gigantic_page(h, page, order); This function is for gigantic page from its name, so shouldn't be called for non-gigantic huge page. Maybe renaming it and/or introducing some inner function layer to factor out common part would be better. > + free_contig_range(page_to_pfn(page), 1 << order); > + } else { > + __free_pages(page, huge_page_order(h)); > + } > } > } > > @@ -1427,6 +1463,11 @@ static void __free_huge_page(struct page *page) > if (restore_reserve) > h->resv_huge_pages++; > > + if (PageHugePoisoned(page)) { > + spin_unlock(&hugetlb_lock); > + return; > + } > + > if (PageHugeTemporary(page)) { > list_del(&page->lru); > ClearPageHugeTemporary(page); > @@ -5642,6 +5683,9 @@ void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason) > hugetlb_cgroup_migrate(oldpage, newpage); > set_page_owner_migrate_reason(newpage, reason); > > + if (reason == MR_MEMORY_FAILURE) > + SetPageHugePoisoned(oldpage); > + > /* > * transfer temporary state of the new huge page. This is > * reverse to other transitions because the newpage is going to > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index caf012d34607..c0ebab4eed4c 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -65,9 +65,17 @@ int sysctl_memory_failure_recovery __read_mostly = 1; > > atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0); > > -static void page_handle_poison(struct page *page) > +static void page_handle_poison(struct page *page, bool release, bool set_flag, > + bool huge_flag) > { > - SetPageHWPoison(page); > + if (set_flag) > + SetPageHWPoison(page); > + > + if (huge_flag) > + dissolve_free_huge_page(page); > + else if (release) > + put_page(page); > + Indentation seems to be broken, you can run checkpatch.pl to find details. Thanks, Naoya Horiguchi > page_ref_inc(page); > num_poisoned_pages_inc(); > } > @@ -1717,7 +1725,7 @@ static int get_any_page(struct page *page, unsigned long pfn) > > static int soft_offline_huge_page(struct page *page) > { > - int ret; > + int ret = -EBUSY; > unsigned long pfn = page_to_pfn(page); > struct page *hpage = compound_head(page); > LIST_HEAD(pagelist); > @@ -1757,19 +1765,12 @@ static int soft_offline_huge_page(struct page *page) > ret = -EIO; > } else { > /* > - * We set PG_hwpoison only when the migration source hugepage > - * was successfully dissolved, because otherwise hwpoisoned > - * hugepage remains on free hugepage list, then userspace will > - * find it as SIGBUS by allocation failure. That's not expected > - * in soft-offlining. > + * At this point the page cannot be in-use since we do not > + * let the page to go back to hugetlb freelists. > + * In that case we just need to dissolve it. > + * page_handle_poison will take care of it. > */ > - ret = dissolve_free_huge_page(page); > - if (!ret) { > - if (set_hwpoison_free_buddy_page(page)) > - num_poisoned_pages_inc(); > - else > - ret = -EBUSY; > - } > + page_handle_poison(page, true, true, true); > } > return ret; > } > @@ -1804,10 +1805,8 @@ static int __soft_offline_page(struct page *page) > * would need to fix isolation locking first. > */ > if (ret == 1) { > - put_page(page); > pr_info("soft_offline: %#lx: invalidated\n", pfn); > - SetPageHWPoison(page); > - num_poisoned_pages_inc(); > + page_handle_poison(page, true, true, false); > return 0; > } > > @@ -1838,7 +1837,9 @@ static int __soft_offline_page(struct page *page) > list_add(&page->lru, &pagelist); > ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL, > MIGRATE_SYNC, MR_MEMORY_FAILURE); > - if (ret) { > + if (!ret) { > + page_handle_poison(page, true, true, false); > + } else { > if (!list_empty(&pagelist)) > putback_movable_pages(&pagelist); > > @@ -1857,37 +1858,25 @@ static int __soft_offline_page(struct page *page) > static int soft_offline_in_use_page(struct page *page) > { > int ret; > - int mt; > struct page *hpage = compound_head(page); > > if (!PageHuge(page) && PageTransHuge(hpage)) > if (try_to_split_thp_page(page, "soft offline") < 0) > return -EBUSY; > > - /* > - * Setting MIGRATE_ISOLATE here ensures that the page will be linked > - * to free list immediately (not via pcplist) when released after > - * successful page migration. Otherwise we can't guarantee that the > - * page is really free after put_page() returns, so > - * set_hwpoison_free_buddy_page() highly likely fails. > - */ > - mt = get_pageblock_migratetype(page); > - set_pageblock_migratetype(page, MIGRATE_ISOLATE); > if (PageHuge(page)) > ret = soft_offline_huge_page(page); > else > ret = __soft_offline_page(page); > - set_pageblock_migratetype(page, mt); > return ret; > } > > static int soft_offline_free_page(struct page *page) > { > int rc = -EBUSY; > - int rc = dissolve_free_huge_page(page); > > if (!dissolve_free_huge_page(page) && take_page_off_buddy(page)) { > - page_handle_poison(page); > + page_handle_poison(page, false, true, false); > rc = 0; > } > > diff --git a/mm/migrate.c b/mm/migrate.c > index 75c10d81e833..a68d81d0ae6e 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -1222,16 +1222,11 @@ static int unmap_and_move(new_page_t get_new_page, > * we want to retry. > */ > if (rc == MIGRATEPAGE_SUCCESS) { > - put_page(page); > - if (reason == MR_MEMORY_FAILURE) { > + if (reason != MR_MEMORY_FAILURE) > /* > - * Set PG_HWPoison on just freed page > - * intentionally. Although it's rather weird, > - * it's how HWPoison flag works at the moment. > + * We handle poisoned pages in page_handle_poison. > */ > - if (set_hwpoison_free_buddy_page(page)) > - num_poisoned_pages_inc(); > - } > + put_page(page); > } else { > if (rc != -EAGAIN) { > if (likely(!__PageMovable(page))) { > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 4fa0e0887c07..11df51fc2718 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1175,6 +1175,16 @@ static __always_inline bool free_pages_prepare(struct page *page, > > trace_mm_page_free(page, order); > > + if (unlikely(PageHWPoison(page)) && !order) { > + /* > + * Untie memcg state and reset page's owner > + */ > + if (memcg_kmem_enabled() && PageKmemcg(page)) > + __memcg_kmem_uncharge_page(page, order); > + reset_page_owner(page, order); > + return false; > + } > + > /* > * Check tail pages before head page information is cleared to > * avoid checking PageCompound for order-0 pages. > @@ -8844,32 +8854,4 @@ bool take_page_off_buddy(struct page *page) > spin_unlock_irqrestore(&zone->lock, flags); > return ret; > } > - > -/* > - * Set PG_hwpoison flag if a given page is confirmed to be a free page. This > - * test is performed under the zone lock to prevent a race against page > - * allocation. > - */ > -bool set_hwpoison_free_buddy_page(struct page *page) > -{ > - struct zone *zone = page_zone(page); > - unsigned long pfn = page_to_pfn(page); > - unsigned long flags; > - unsigned int order; > - bool hwpoisoned = false; > - > - spin_lock_irqsave(&zone->lock, flags); > - for (order = 0; order < MAX_ORDER; order++) { > - struct page *page_head = page - (pfn & ((1 << order) - 1)); > - > - if (PageBuddy(page_head) && page_order(page_head) >= order) { > - if (!TestSetPageHWPoison(page)) > - hwpoisoned = true; > - break; > - } > - } > - spin_unlock_irqrestore(&zone->lock, flags); > - > - return hwpoisoned; > -} > #endif > -- > 2.26.2 >