Re: [PATCH v4 12/15] mm,hwpoison: Rework soft offline for in-use pages

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jul 16, 2020 at 02:38:06PM +0200, Oscar Salvador wrote:
> This patch changes the way we set and handle in-use poisoned pages.
> Until now, poisoned pages were released to the buddy allocator, trusting
> that the checks that take place prior to deliver the page to its end
> user would act as a safe net and would skip that page.
> 
> This has proved to be wrong, as we got some pfn walkers out there, like
> compaction, that all they care is the page to be PageBuddy and be in a
> freelist.
> Although this might not be the only user, having poisoned pages
> in the buddy allocator seems a bad idea as we should only have
> free pages that are ready and meant to be used as such.
> 
> Before explaining the taken approach, let us break down the kind
> of pages we can soft offline.
> 
> - Anonymous THP (after the split, they end up being 4K pages)
> - Hugetlb
> - Order-0 pages (that can be either migrated or invalited)
> 
> * Normal pages (order-0 and anon-THP)
> 
>   - If they are clean and unmapped page cache pages, we invalidate
>     then by means of invalidate_inode_page().
>   - If they are mapped/dirty, we do the isolate-and-migrate dance.
> 
>   Either way, do not call put_page directly from those paths.
>   Instead, we keep the page and send it to page_set_poison to perform the
>   right handling.
> 
>   Among other things, page_set_poison() sets the HWPoison flag and does the last
>   put_page.
>   This call to put_page is mainly to be able to call __page_cache_release,
>   since this function is not exported.
> 
>   Down the chain, we placed a check for HWPoison page in
>   free_pages_prepare, that just skips any poisoned page, so those pages
>   do not end up either in a pcplist or in buddy-freelist.
> 
>   After that, we set the refcount on the page to 1 and we increment
>   the poisoned pages counter.
> 
>   We could do as we do for free pages:
>   1) wait until the page hits buddy's freelists
>   2) take it off
>   3) flag it
> 
>   The problem is that we could race with an allocation, so by the time we
>   want to take the page off the buddy, the page is already allocated, so we
>   cannot soft-offline it.
>   This is not fatal of course, but if it is better if we can close the race
>   as does not require a lot of code.
> 
> * Hugetlb pages
> 
>   - We isolate-and-migrate them
> 
>   There is no magic in here, we just isolate and migrate them.
>   A new set of internal functions have been made to flag a hugetlb page as
>   poisoned (SetPageHugePoisoned(), PageHugePoisoned(), ClearPageHugePoisoned())
>   This allows us to flag the page when we migrate it, back in
>   move_hugetlb_state().
>
>   Later on we check whether the page is poisoned in __free_huge_page,
>   and we bail out in that case before sending the page to e.g: active
>   free list.
>   This gives us full control of the page, and we can handle it
>   page_handle_poison().
> 
>   In other words, we do not allow migrated hugepages to get back to the
>   freelists.
> 
>   Since now the page has no user and has been migrated, we can call
>   dissolve_free_huge_page, which will end up calling update_and_free_page.
>   In update_and_free_page(), we check for the page to be poisoned.
>   If it so, we handle it as we handle gigantic pages, i.e: we break down
>   the page in order-0 pages and free them one by one.
>   Doing so, allows us for free_pages_prepare to skip poisoned pages.
> 
> Because of the way we handle now in-use pages, we no longer need the
> put-as-isolation-migratetype dance, that was guarding for poisoned pages
> to end up in pcplists.

I ran Quan Cai's test program (https://github.com/cailca/linux-mm) on a
small (4GB memory) VM, and weiredly found that (1) the target hugepages
are not always dissolved and (2) dissovled hugetpages are still counted
in "HugePages_Total:". See below:

    $ ./random 1
    - start: migrate_huge_offline
    - use NUMA nodes 0,1.
    - mmap and free 8388608 bytes hugepages on node 0
    - mmap and free 8388608 bytes hugepages on node 1
    madvise: Cannot allocate memory

    $ cat /proc/meminfo
    MemTotal:        4026772 kB
    MemFree:          976300 kB
    MemAvailable:     892840 kB
    Buffers:           20936 kB
    Cached:            99768 kB
    SwapCached:         5904 kB
    Active:            84332 kB
    Inactive:         116328 kB
    Active(anon):      27944 kB
    Inactive(anon):    68524 kB
    Active(file):      56388 kB
    Inactive(file):    47804 kB
    Unevictable:        7532 kB
    Mlocked:               0 kB
    SwapTotal:       2621436 kB
    SwapFree:        2609844 kB
    Dirty:                56 kB
    Writeback:             0 kB
    AnonPages:         81764 kB
    Mapped:            54348 kB
    Shmem:              8948 kB
    KReclaimable:      22744 kB
    Slab:              52056 kB
    SReclaimable:      22744 kB
    SUnreclaim:        29312 kB
    KernelStack:        3888 kB
    PageTables:         2804 kB
    NFS_Unstable:          0 kB
    Bounce:                0 kB
    WritebackTmp:          0 kB
    CommitLimit:     3260612 kB
    Committed_AS:     828196 kB
    VmallocTotal:   34359738367 kB
    VmallocUsed:       19260 kB
    VmallocChunk:          0 kB
    Percpu:             5120 kB
    HardwareCorrupted:  5368 kB
    AnonHugePages:     18432 kB
    ShmemHugePages:        0 kB
    ShmemPmdMapped:        0 kB
    FileHugePages:         0 kB
    FilePmdMapped:         0 kB
    CmaTotal:              0 kB
    CmaFree:               0 kB
    HugePages_Total:    1342     // still counted as hugetlb pages.
    HugePages_Free:        0     // all hugepage are still allocated (or leaked?)
    HugePages_Rsvd:        0
    HugePages_Surp:      762     // some are counted in surplus.
    Hugepagesize:       2048 kB
    Hugetlb:         2748416 kB
    DirectMap4k:      112480 kB
    DirectMap2M:     4081664 kB


    $ page-types -b hwpoison
                 flags      page-count       MB  symbolic-flags                     long-symbolic-flags
    0x0000000000080008             421        1  ___U_______________X_______________________      uptodate,hwpoison
    0x00000000000a8018               1        0  ___UD__________H_G_X_______________________      uptodate,dirty,compound_head,huge,hwpoison
    0x00000000000a801c             920        3  __RUD__________H_G_X_______________________      referenced,uptodate,dirty,compound_head,huge,hwpoison
                 total            1342        5

This means that some hugepages are dissolved, but the others not,
maybe which is not desirable.
I'll dig this more later but just let me share at first.

A few minor comment below ...

> 
> Signed-off-by: Oscar Salvador <osalvador@xxxxxxxx>
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@xxxxxxx>
> ---
>  include/linux/page-flags.h |  5 ----
>  mm/hugetlb.c               | 60 +++++++++++++++++++++++++++++++++-----
>  mm/memory-failure.c        | 53 +++++++++++++--------------------
>  mm/migrate.c               | 11 ++-----
>  mm/page_alloc.c            | 38 +++++++-----------------
>  5 files changed, 86 insertions(+), 81 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 01baf6d426ff..2ac8bfa0cf20 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -426,13 +426,8 @@ PAGEFLAG(HWPoison, hwpoison, PF_ANY)
>  TESTSCFLAG(HWPoison, hwpoison, PF_ANY)
>  #define __PG_HWPOISON (1UL << PG_hwpoison)
>  extern bool take_page_off_buddy(struct page *page);
> -extern bool set_hwpoison_free_buddy_page(struct page *page);
>  #else
>  PAGEFLAG_FALSE(HWPoison)
> -static inline bool set_hwpoison_free_buddy_page(struct page *page)
> -{
> -	return 0;
> -}
>  #define __PG_HWPOISON 0
>  #endif
>  
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 7badb01d15e3..1c6397936512 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -29,6 +29,7 @@
>  #include <linux/numa.h>
>  #include <linux/llist.h>
>  #include <linux/cma.h>
> +#include <linux/migrate.h>
>  
>  #include <asm/page.h>
>  #include <asm/pgalloc.h>
> @@ -1209,9 +1210,26 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
>  		((node = hstate_next_node_to_free(hs, mask)) || 1);	\
>  		nr_nodes--)
>  
> -#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
> -static void destroy_compound_gigantic_page(struct page *page,
> -					unsigned int order)
> +static inline bool PageHugePoisoned(struct page *page)
> +{
> +	if (!PageHuge(page))
> +		return false;
> +
> +	return (unsigned long)page[3].mapping == -1U;
> +}
> +
> +static inline void SetPageHugePoisoned(struct page *page)
> +{
> +	page[3].mapping = (void *)-1U;
> +}
> +
> +static inline void ClearPageHugePoisoned(struct page *page)
> +{
> +	page[3].mapping = NULL;
> +}
> +
> +static void destroy_compound_gigantic_page(struct hstate *h, struct page *page,
> +					   unsigned int order)
>  {
>  	int i;
>  	int nr_pages = 1 << order;
> @@ -1222,14 +1240,19 @@ static void destroy_compound_gigantic_page(struct page *page,
>  		atomic_set(compound_pincount_ptr(page), 0);
>  
>  	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
> +		if (!hstate_is_gigantic(h))
> +			 p->mapping = NULL;
>  		clear_compound_head(p);
>  		set_page_refcounted(p);
>  	}
>  
> +	if (PageHugePoisoned(page))
> +		ClearPageHugePoisoned(page);
>  	set_compound_order(page, 0);
>  	__ClearPageHead(page);
>  }
>  
> +#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
>  static void free_gigantic_page(struct page *page, unsigned int order)
>  {
>  	/*
> @@ -1284,13 +1307,16 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
>  	return NULL;
>  }
>  static inline void free_gigantic_page(struct page *page, unsigned int order) { }
> -static inline void destroy_compound_gigantic_page(struct page *page,
> -						unsigned int order) { }
> +static inline void destroy_compound_gigantic_page(struct hstate *h,
> +						  struct page *page,
> +						  unsigned int order) { }
>  #endif
>  
>  static void update_and_free_page(struct hstate *h, struct page *page)
>  {
>  	int i;
> +	bool poisoned = PageHugePoisoned(page);
> +	unsigned int order = huge_page_order(h);
>  
>  	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
>  		return;
> @@ -1313,11 +1339,21 @@ static void update_and_free_page(struct hstate *h, struct page *page)
>  		 * we might block in free_gigantic_page().
>  		 */
>  		spin_unlock(&hugetlb_lock);
> -		destroy_compound_gigantic_page(page, huge_page_order(h));
> -		free_gigantic_page(page, huge_page_order(h));
> +		destroy_compound_gigantic_page(h, page, order);
> +		free_gigantic_page(page, order);
>  		spin_lock(&hugetlb_lock);
>  	} else {
> -		__free_pages(page, huge_page_order(h));
> +		if (unlikely(poisoned)) {
> +			/*
> +			 * If the hugepage is poisoned, do as we do for
> +			 * gigantic pages and free the pages as order-0.
> +			 * free_pages_prepare will skip over the poisoned ones.
> +			 */
> +			destroy_compound_gigantic_page(h, page, order);

This function is for gigantic page from its name, so shouldn't be called
for non-gigantic huge page. Maybe renaming it and/or introducing some inner
function layer to factor out common part would be better.

> +			free_contig_range(page_to_pfn(page), 1 << order);
> +		} else {
> +			__free_pages(page, huge_page_order(h));
> +		}
>  	}
>  }
>  
> @@ -1427,6 +1463,11 @@ static void __free_huge_page(struct page *page)
>  	if (restore_reserve)
>  		h->resv_huge_pages++;
>  
> +	if (PageHugePoisoned(page)) {
> +		spin_unlock(&hugetlb_lock);
> +		return;
> +	}
> +
>  	if (PageHugeTemporary(page)) {
>  		list_del(&page->lru);
>  		ClearPageHugeTemporary(page);
> @@ -5642,6 +5683,9 @@ void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason)
>  	hugetlb_cgroup_migrate(oldpage, newpage);
>  	set_page_owner_migrate_reason(newpage, reason);
>  
> +	if (reason == MR_MEMORY_FAILURE)
> +		SetPageHugePoisoned(oldpage);
> +
>  	/*
>  	 * transfer temporary state of the new huge page. This is
>  	 * reverse to other transitions because the newpage is going to
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index caf012d34607..c0ebab4eed4c 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -65,9 +65,17 @@ int sysctl_memory_failure_recovery __read_mostly = 1;
>  
>  atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
>  
> -static void page_handle_poison(struct page *page)
> +static void page_handle_poison(struct page *page, bool release, bool set_flag,
> +			       bool huge_flag)
>  {
> -	SetPageHWPoison(page);
> +	if (set_flag)
> +		SetPageHWPoison(page);
> +
> +        if (huge_flag)
> +		dissolve_free_huge_page(page);
> +        else if (release)
> +		put_page(page);
> +

Indentation seems to be broken, you can run checkpatch.pl to find details.

Thanks,
Naoya Horiguchi

>  	page_ref_inc(page);
>  	num_poisoned_pages_inc();
>  }
> @@ -1717,7 +1725,7 @@ static int get_any_page(struct page *page, unsigned long pfn)
>  
>  static int soft_offline_huge_page(struct page *page)
>  {
> -	int ret;
> +	int ret = -EBUSY;
>  	unsigned long pfn = page_to_pfn(page);
>  	struct page *hpage = compound_head(page);
>  	LIST_HEAD(pagelist);
> @@ -1757,19 +1765,12 @@ static int soft_offline_huge_page(struct page *page)
>  			ret = -EIO;
>  	} else {
>  		/*
> -		 * We set PG_hwpoison only when the migration source hugepage
> -		 * was successfully dissolved, because otherwise hwpoisoned
> -		 * hugepage remains on free hugepage list, then userspace will
> -		 * find it as SIGBUS by allocation failure. That's not expected
> -		 * in soft-offlining.
> +		 * At this point the page cannot be in-use since we do not
> +		 * let the page to go back to hugetlb freelists.
> +		 * In that case we just need to dissolve it.
> +		 * page_handle_poison will take care of it.
>  		 */
> -		ret = dissolve_free_huge_page(page);
> -		if (!ret) {
> -			if (set_hwpoison_free_buddy_page(page))
> -				num_poisoned_pages_inc();
> -			else
> -				ret = -EBUSY;
> -		}
> +		page_handle_poison(page, true, true, true);
>  	}
>  	return ret;
>  }
> @@ -1804,10 +1805,8 @@ static int __soft_offline_page(struct page *page)
>  	 * would need to fix isolation locking first.
>  	 */
>  	if (ret == 1) {
> -		put_page(page);
>  		pr_info("soft_offline: %#lx: invalidated\n", pfn);
> -		SetPageHWPoison(page);
> -		num_poisoned_pages_inc();
> +		page_handle_poison(page, true, true, false);
>  		return 0;
>  	}
>  
> @@ -1838,7 +1837,9 @@ static int __soft_offline_page(struct page *page)
>  		list_add(&page->lru, &pagelist);
>  		ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
>  					MIGRATE_SYNC, MR_MEMORY_FAILURE);
> -		if (ret) {
> +		if (!ret) {
> +			page_handle_poison(page, true, true, false);
> +		} else {
>  			if (!list_empty(&pagelist))
>  				putback_movable_pages(&pagelist);
>  
> @@ -1857,37 +1858,25 @@ static int __soft_offline_page(struct page *page)
>  static int soft_offline_in_use_page(struct page *page)
>  {
>  	int ret;
> -	int mt;
>  	struct page *hpage = compound_head(page);
>  
>  	if (!PageHuge(page) && PageTransHuge(hpage))
>  		if (try_to_split_thp_page(page, "soft offline") < 0)
>  			return -EBUSY;
>  
> -	/*
> -	 * Setting MIGRATE_ISOLATE here ensures that the page will be linked
> -	 * to free list immediately (not via pcplist) when released after
> -	 * successful page migration. Otherwise we can't guarantee that the
> -	 * page is really free after put_page() returns, so
> -	 * set_hwpoison_free_buddy_page() highly likely fails.
> -	 */
> -	mt = get_pageblock_migratetype(page);
> -	set_pageblock_migratetype(page, MIGRATE_ISOLATE);
>  	if (PageHuge(page))
>  		ret = soft_offline_huge_page(page);
>  	else
>  		ret = __soft_offline_page(page);
> -	set_pageblock_migratetype(page, mt);
>  	return ret;
>  }
>  
>  static int soft_offline_free_page(struct page *page)
>  {
>  	int rc = -EBUSY;
> -	int rc = dissolve_free_huge_page(page);
>  
>  	if (!dissolve_free_huge_page(page) && take_page_off_buddy(page)) {
> -		page_handle_poison(page);
> +		page_handle_poison(page, false, true, false);
>  		rc = 0;
>  	}
>  
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 75c10d81e833..a68d81d0ae6e 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1222,16 +1222,11 @@ static int unmap_and_move(new_page_t get_new_page,
>  	 * we want to retry.
>  	 */
>  	if (rc == MIGRATEPAGE_SUCCESS) {
> -		put_page(page);
> -		if (reason == MR_MEMORY_FAILURE) {
> +		if (reason != MR_MEMORY_FAILURE)
>  			/*
> -			 * Set PG_HWPoison on just freed page
> -			 * intentionally. Although it's rather weird,
> -			 * it's how HWPoison flag works at the moment.
> +			 * We handle poisoned pages in page_handle_poison.
>  			 */
> -			if (set_hwpoison_free_buddy_page(page))
> -				num_poisoned_pages_inc();
> -		}
> +			put_page(page);
>  	} else {
>  		if (rc != -EAGAIN) {
>  			if (likely(!__PageMovable(page))) {
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4fa0e0887c07..11df51fc2718 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1175,6 +1175,16 @@ static __always_inline bool free_pages_prepare(struct page *page,
>  
>  	trace_mm_page_free(page, order);
>  
> +	if (unlikely(PageHWPoison(page)) && !order) {
> +		/*
> +		 * Untie memcg state and reset page's owner
> +		 */
> +		if (memcg_kmem_enabled() && PageKmemcg(page))
> +			__memcg_kmem_uncharge_page(page, order);
> +		reset_page_owner(page, order);
> +		return false;
> +	}
> +
>  	/*
>  	 * Check tail pages before head page information is cleared to
>  	 * avoid checking PageCompound for order-0 pages.
> @@ -8844,32 +8854,4 @@ bool take_page_off_buddy(struct page *page)
>  	spin_unlock_irqrestore(&zone->lock, flags);
>  	return ret;
>  }
> -
> -/*
> - * Set PG_hwpoison flag if a given page is confirmed to be a free page.  This
> - * test is performed under the zone lock to prevent a race against page
> - * allocation.
> - */
> -bool set_hwpoison_free_buddy_page(struct page *page)
> -{
> -	struct zone *zone = page_zone(page);
> -	unsigned long pfn = page_to_pfn(page);
> -	unsigned long flags;
> -	unsigned int order;
> -	bool hwpoisoned = false;
> -
> -	spin_lock_irqsave(&zone->lock, flags);
> -	for (order = 0; order < MAX_ORDER; order++) {
> -		struct page *page_head = page - (pfn & ((1 << order) - 1));
> -
> -		if (PageBuddy(page_head) && page_order(page_head) >= order) {
> -			if (!TestSetPageHWPoison(page))
> -				hwpoisoned = true;
> -			break;
> -		}
> -	}
> -	spin_unlock_irqrestore(&zone->lock, flags);
> -
> -	return hwpoisoned;
> -}
>  #endif
> -- 
> 2.26.2
> 




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux