Re: FAILED: patch "[PATCH] mm: migration: fix migration of huge PMD shared pages" failed to apply to 4.4-stable tree

Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> · Mon, 19 Nov 2018 13:58:15 +0100

On Wed, Oct 17, 2018 at 08:23:34AM -0700, Mike Kravetz wrote:
> On 10/17/18 7:30 AM, Jerome Glisse wrote:
> > On Thu, Oct 11, 2018 at 03:42:59PM -0700, Mike Kravetz wrote:
> >> On 10/10/18 11:04 PM, gregkh@xxxxxxxxxxxxxxxxxxx wrote:
> >>  
> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> index 1bceb49aa214..c9209fc69376 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -1324,6 +1324,9 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >>  	pte_t pteval;
> >>  	spinlock_t *ptl;
> >>  	int ret = SWAP_AGAIN;
> >> +	unsigned long sh_address;
> >> +	bool pmd_sharing_possible = false;
> >> +	unsigned long spmd_start, spmd_end;
> >>  	enum ttu_flags flags = (enum ttu_flags)arg;
> >>  
> >>  	/* munlock has nothing to gain from examining un-locked vmas */
> >> @@ -1334,6 +1337,30 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >>  	if (!pte)
> >>  		goto out;
> >>  
> >> +	/*
> >> +	 * Only use the range_start/end mmu notifiers if huge pmd sharing
> >> +	 * is possible.
> >> +	 */
> >> +	if (PageHuge(page)) {
> >> +		spmd_start = address;
> >> +		spmd_end = spmd_start + vma_mmu_pagesize(vma);
> >> +
> >> +		/*
> >> +		 * Check if pmd sharing is possible.  If possible, we could
> >> +		 * unmap a PUD_SIZE range.  spmd_start/spmd_end will be
> >> +		 * modified if sharing is possible.
> >> +		 */
> >> +		adjust_range_if_pmd_sharing_possible(vma, &spmd_start,
> >> +								&spmd_end);
> >> +		if (spmd_end - spmd_start != vma_mmu_pagesize(vma)) {
> >> +			sh_address = address;
> >> +
> >> +			pmd_sharing_possible = true;
> >> +			mmu_notifier_invalidate_range_start(vma->vm_mm,
> >> +							spmd_start, spmd_end);
> >> +		}
> >> +	}
> > 
> > This needs to happen before page_check_address() as page_check_address()
> > take the page table spinlock and mmu_notifier_invalidate_range_start() can
> > sleep.
> > 
> > Looking at adjust_range_if_pmd_sharing_possible() and vma_mmu_pagesize()
> > it seems they can happen before taking the page table lock.
> > 
> > That being said this is 4.4 i think only ODP code for infiniband would have
> > issues with that.
> > 
> > [Sorry for late reply i am back from PTO and still catching up on emails]
> > 
> 
> Thanks!
> 
> You are correct.  That 'if (PageHuge(page))' should happen earlier in the
> routine.  The adjust_range_if_pmd_sharing_possible and vma_mmu_pagesize
> calls are just fine there as well.  In fact, that is how I have the code
> structured in an older kernel run internally.  Not sure why I changed it in
> this version.
> 
> Here is an updated version with this change as well as better comments as
> suggested by Michal.
> 
> From: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
> 
> mm: migration: fix migration of huge PMD shared pages
> 
> commit 017b1660df89f5fb4bfe66c34e35f7d2031100c7 upstream
> 
> The page migration code employs try_to_unmap() to try and unmap the
> source page.  This is accomplished by using rmap_walk to find all
> vmas where the page is mapped.  This search stops when page mapcount
> is zero.  For shared PMD huge pages, the page map count is always 1
> no matter the number of mappings.  Shared mappings are tracked via
> the reference count of the PMD page.  Therefore, try_to_unmap stops
> prematurely and does not completely unmap all mappings of the source
> page.
> 
> This problem can result is data corruption as writes to the original
> source page can happen after contents of the page are copied to the
> target page.  Hence, data is lost.
> 
> This problem was originally seen as DB corruption of shared global
> areas after a huge page was soft offlined due to ECC memory errors.
> DB developers noticed they could reproduce the issue by (hotplug)
> offlining memory used to back huge pages.  A simple testcase can
> reproduce the problem by creating a shared PMD mapping (note that
> this must be at least PUD_SIZE in size and PUD_SIZE aligned (1GB on
> x86)), and using migrate_pages() to migrate process pages between
> nodes while continually writing to the huge pages being migrated.
> 
> To fix, have the try_to_unmap_one routine check for huge PMD sharing
> by calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a
> shared mapping it will be 'unshared' which removes the page table
> entry and drops the reference on the PMD page.  After this, flush
> caches and TLB.
> 
> mmu notifiers are called before locking page tables, but we can not
> be sure of PMD sharing until page tables are locked.  Therefore,
> check for the possibility of PMD sharing before locking so that
> notifiers can prepare for the worst possible case.  The mmu notifier
> calls in this commit are different than upstream.  That is because
> upstream went to a different model here.  Instead of moving to the
> new model, we leave existing model unchanged and only use the
> mmu_*range* calles in this special case.
> 
> Fixes: 39dde65c9940 ("shared page table for hugetlb page")
> Cc: stable@xxxxxxxxxxxxxxx
> Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
> ---
>  include/linux/hugetlb.h | 14 +++++++++++
>  include/linux/mm.h      |  6 +++++
>  mm/hugetlb.c            | 37 +++++++++++++++++++++++++--
>  mm/rmap.c               | 56 +++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 111 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 685c262e0be8..3957d99e66ea 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -110,6 +110,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
>  			unsigned long addr, unsigned long sz);
>  pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
>  int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
> +void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
> +				unsigned long *start, unsigned long *end);
>  struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
>  			      int write);
>  struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> @@ -132,6 +134,18 @@ static inline unsigned long hugetlb_total_pages(void)
>  	return 0;
>  }
>  
> +static inline int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr,
> +						pte_t *ptep)
> +{
> +	return 0;
> +}
> +
> +static inline void adjust_range_if_pmd_sharing_possible(
> +				struct vm_area_struct *vma,
> +				unsigned long *start, unsigned long *end)
> +{
> +}
> +
>  #define follow_hugetlb_page(m,v,p,vs,a,b,i,w)	({ BUG(); 0; })
>  #define follow_huge_addr(mm, addr, write)	ERR_PTR(-EINVAL)
>  #define copy_hugetlb_page_range(src, dst, vma)	({ BUG(); 0; })
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1f4366567e7d..d4e8077fca96 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2058,6 +2058,12 @@ static inline struct vm_area_struct *find_exact_vma(struct mm_struct *mm,
>  	return vma;
>  }
>  
> +static inline bool range_in_vma(struct vm_area_struct *vma,
> +				unsigned long start, unsigned long end)
> +{
> +	return (vma && vma->vm_start <= start && end <= vma->vm_end);
> +}
> +
>  #ifdef CONFIG_MMU
>  pgprot_t vm_get_page_prot(unsigned long vm_flags);
>  void vma_set_page_prot(struct vm_area_struct *vma);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a813b03021b7..279c4d87deeb 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4195,12 +4195,40 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
>  	/*
>  	 * check on proper vm_flags and page table alignment
>  	 */
> -	if (vma->vm_flags & VM_MAYSHARE &&
> -	    vma->vm_start <= base && end <= vma->vm_end)
> +	if (vma->vm_flags & VM_MAYSHARE && range_in_vma(vma, base, end))
>  		return true;
>  	return false;
>  }
>  
> +/*
> + * Determine if start,end range within vma could be mapped by shared pmd.
> + * If yes, adjust start and end to cover range associated with possible
> + * shared pmd mappings.
> + */
> +void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
> +				unsigned long *start, unsigned long *end)
> +{
> +	unsigned long check_addr = *start;
> +
> +	if (!(vma->vm_flags & VM_MAYSHARE))
> +		return;
> +
> +	for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) {
> +		unsigned long a_start = check_addr & PUD_MASK;
> +		unsigned long a_end = a_start + PUD_SIZE;
> +
> +		/*
> +		 * If sharing is possible, adjust start/end if necessary.
> +		 */
> +		if (range_in_vma(vma, a_start, a_end)) {
> +			if (a_start < *start)
> +				*start = a_start;
> +			if (a_end > *end)
> +				*end = a_end;
> +		}
> +	}
> +}
> +
>  /*
>   * Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
>   * and returns the corresponding pte. While this is not necessary for the
> @@ -4297,6 +4325,11 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
>  {
>  	return 0;
>  }
> +
> +void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
> +				unsigned long *start, unsigned long *end)
> +{
> +}
>  #define want_pmd_share()	(0)
>  #endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
>  
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 1bceb49aa214..488dda209431 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1324,12 +1324,41 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	pte_t pteval;
>  	spinlock_t *ptl;
>  	int ret = SWAP_AGAIN;
> +	unsigned long sh_address;
> +	bool pmd_sharing_possible = false;
> +	unsigned long spmd_start, spmd_end;
>  	enum ttu_flags flags = (enum ttu_flags)arg;
>  
>  	/* munlock has nothing to gain from examining un-locked vmas */
>  	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
>  		goto out;
>  
> +	/*
> +	 * Only use the range_start/end mmu notifiers if huge pmd sharing
> +	 * is possible.  In the normal case, mmu_notifier_invalidate_page
> +	 * is sufficient as we only unmap a page.  However, if we unshare
> +	 * a pmd, we will unmap a PUD_SIZE range.
> +	 */
> +	if (PageHuge(page)) {
> +		spmd_start = address;
> +		spmd_end = spmd_start + vma_mmu_pagesize(vma);
> +
> +		/*
> +		 * Check if pmd sharing is possible.  If possible, we could
> +		 * unmap a PUD_SIZE range.  spmd_start/spmd_end will be
> +		 * modified if sharing is possible.
> +		 */
> +		adjust_range_if_pmd_sharing_possible(vma, &spmd_start,
> +								&spmd_end);
> +		if (spmd_end - spmd_start != vma_mmu_pagesize(vma)) {
> +			sh_address = address;
> +
> +			pmd_sharing_possible = true;
> +			mmu_notifier_invalidate_range_start(vma->vm_mm,
> +							spmd_start, spmd_end);
> +		}
> +	}
> +
>  	pte = page_check_address(page, mm, address, &ptl, 0);
>  	if (!pte)
>  		goto out;
> @@ -1356,6 +1385,30 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  		}
>    	}
>  
> +	/*
> +	 * Call huge_pmd_unshare to potentially unshare a huge pmd.  Pass
> +	 * sh_address as it will be modified if unsharing is successful.
> +	 */
> +	if (PageHuge(page) && huge_pmd_unshare(mm, &sh_address, pte)) {
> +		/*
> +		 * huge_pmd_unshare unmapped an entire PMD page.  There is
> +		 * no way of knowing exactly which PMDs may be cached for
> +		 * this mm, so flush them all.  spmd_start/spmd_end cover
> +		 * this PUD_SIZE range.
> +		 */
> +		flush_cache_range(vma, spmd_start, spmd_end);
> +		flush_tlb_range(vma, spmd_start, spmd_end);
> +
> +		/*
> +		 * The ref count of the PMD page was dropped which is part
> +		 * of the way map counting is done for shared PMDs.  When
> +		 * there is no other sharing, huge_pmd_unshare returns false
> +		 * and we will unmap the actual page and drop map count
> +		 * to zero.
> +		 */
> +		goto out_unmap;
> +	}
> +
>  	/* Nuke the page table entry. */
>  	flush_cache_page(vma, address, page_to_pfn(page));
>  	if (should_defer_flush(mm, flags)) {
> @@ -1450,6 +1503,9 @@ out_unmap:
>  	if (ret != SWAP_FAIL && ret != SWAP_MLOCK && !(flags & TTU_MUNLOCK))
>  		mmu_notifier_invalidate_page(mm, address);
>  out:
> +	if (pmd_sharing_possible)
> +		mmu_notifier_invalidate_range_end(vma->vm_mm,
> +							spmd_start, spmd_end);
>  	return ret;
>  }
>  
> -- 
> 2.17.2

Now queued up, thanks.

greg k-h