On Wed, Oct 17, 2018 at 08:23:34AM -0700, Mike Kravetz wrote: > On 10/17/18 7:30 AM, Jerome Glisse wrote: > > On Thu, Oct 11, 2018 at 03:42:59PM -0700, Mike Kravetz wrote: > >> On 10/10/18 11:04 PM, gregkh@xxxxxxxxxxxxxxxxxxx wrote: > >> > >> diff --git a/mm/rmap.c b/mm/rmap.c > >> index 1bceb49aa214..c9209fc69376 100644 > >> --- a/mm/rmap.c > >> +++ b/mm/rmap.c > >> @@ -1324,6 +1324,9 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > >> pte_t pteval; > >> spinlock_t *ptl; > >> int ret = SWAP_AGAIN; > >> + unsigned long sh_address; > >> + bool pmd_sharing_possible = false; > >> + unsigned long spmd_start, spmd_end; > >> enum ttu_flags flags = (enum ttu_flags)arg; > >> > >> /* munlock has nothing to gain from examining un-locked vmas */ > >> @@ -1334,6 +1337,30 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > >> if (!pte) > >> goto out; > >> > >> + /* > >> + * Only use the range_start/end mmu notifiers if huge pmd sharing > >> + * is possible. > >> + */ > >> + if (PageHuge(page)) { > >> + spmd_start = address; > >> + spmd_end = spmd_start + vma_mmu_pagesize(vma); > >> + > >> + /* > >> + * Check if pmd sharing is possible. If possible, we could > >> + * unmap a PUD_SIZE range. spmd_start/spmd_end will be > >> + * modified if sharing is possible. > >> + */ > >> + adjust_range_if_pmd_sharing_possible(vma, &spmd_start, > >> + &spmd_end); > >> + if (spmd_end - spmd_start != vma_mmu_pagesize(vma)) { > >> + sh_address = address; > >> + > >> + pmd_sharing_possible = true; > >> + mmu_notifier_invalidate_range_start(vma->vm_mm, > >> + spmd_start, spmd_end); > >> + } > >> + } > > > > This needs to happen before page_check_address() as page_check_address() > > take the page table spinlock and mmu_notifier_invalidate_range_start() can > > sleep. > > > > Looking at adjust_range_if_pmd_sharing_possible() and vma_mmu_pagesize() > > it seems they can happen before taking the page table lock. > > > > That being said this is 4.4 i think only ODP code for infiniband would have > > issues with that. > > > > [Sorry for late reply i am back from PTO and still catching up on emails] > > > > Thanks! > > You are correct. That 'if (PageHuge(page))' should happen earlier in the > routine. The adjust_range_if_pmd_sharing_possible and vma_mmu_pagesize > calls are just fine there as well. In fact, that is how I have the code > structured in an older kernel run internally. Not sure why I changed it in > this version. > > Here is an updated version with this change as well as better comments as > suggested by Michal. > > From: Mike Kravetz <mike.kravetz@xxxxxxxxxx> > > mm: migration: fix migration of huge PMD shared pages > > commit 017b1660df89f5fb4bfe66c34e35f7d2031100c7 upstream > > The page migration code employs try_to_unmap() to try and unmap the > source page. This is accomplished by using rmap_walk to find all > vmas where the page is mapped. This search stops when page mapcount > is zero. For shared PMD huge pages, the page map count is always 1 > no matter the number of mappings. Shared mappings are tracked via > the reference count of the PMD page. Therefore, try_to_unmap stops > prematurely and does not completely unmap all mappings of the source > page. > > This problem can result is data corruption as writes to the original > source page can happen after contents of the page are copied to the > target page. Hence, data is lost. > > This problem was originally seen as DB corruption of shared global > areas after a huge page was soft offlined due to ECC memory errors. > DB developers noticed they could reproduce the issue by (hotplug) > offlining memory used to back huge pages. A simple testcase can > reproduce the problem by creating a shared PMD mapping (note that > this must be at least PUD_SIZE in size and PUD_SIZE aligned (1GB on > x86)), and using migrate_pages() to migrate process pages between > nodes while continually writing to the huge pages being migrated. > > To fix, have the try_to_unmap_one routine check for huge PMD sharing > by calling huge_pmd_unshare for hugetlbfs huge pages. If it is a > shared mapping it will be 'unshared' which removes the page table > entry and drops the reference on the PMD page. After this, flush > caches and TLB. > > mmu notifiers are called before locking page tables, but we can not > be sure of PMD sharing until page tables are locked. Therefore, > check for the possibility of PMD sharing before locking so that > notifiers can prepare for the worst possible case. The mmu notifier > calls in this commit are different than upstream. That is because > upstream went to a different model here. Instead of moving to the > new model, we leave existing model unchanged and only use the > mmu_*range* calles in this special case. > > Fixes: 39dde65c9940 ("shared page table for hugetlb page") > Cc: stable@xxxxxxxxxxxxxxx > Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx> > --- > include/linux/hugetlb.h | 14 +++++++++++ > include/linux/mm.h | 6 +++++ > mm/hugetlb.c | 37 +++++++++++++++++++++++++-- > mm/rmap.c | 56 +++++++++++++++++++++++++++++++++++++++++ > 4 files changed, 111 insertions(+), 2 deletions(-) > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > index 685c262e0be8..3957d99e66ea 100644 > --- a/include/linux/hugetlb.h > +++ b/include/linux/hugetlb.h > @@ -110,6 +110,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, > unsigned long addr, unsigned long sz); > pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr); > int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep); > +void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma, > + unsigned long *start, unsigned long *end); > struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address, > int write); > struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address, > @@ -132,6 +134,18 @@ static inline unsigned long hugetlb_total_pages(void) > return 0; > } > > +static inline int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, > + pte_t *ptep) > +{ > + return 0; > +} > + > +static inline void adjust_range_if_pmd_sharing_possible( > + struct vm_area_struct *vma, > + unsigned long *start, unsigned long *end) > +{ > +} > + > #define follow_hugetlb_page(m,v,p,vs,a,b,i,w) ({ BUG(); 0; }) > #define follow_huge_addr(mm, addr, write) ERR_PTR(-EINVAL) > #define copy_hugetlb_page_range(src, dst, vma) ({ BUG(); 0; }) > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 1f4366567e7d..d4e8077fca96 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2058,6 +2058,12 @@ static inline struct vm_area_struct *find_exact_vma(struct mm_struct *mm, > return vma; > } > > +static inline bool range_in_vma(struct vm_area_struct *vma, > + unsigned long start, unsigned long end) > +{ > + return (vma && vma->vm_start <= start && end <= vma->vm_end); > +} > + > #ifdef CONFIG_MMU > pgprot_t vm_get_page_prot(unsigned long vm_flags); > void vma_set_page_prot(struct vm_area_struct *vma); > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index a813b03021b7..279c4d87deeb 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -4195,12 +4195,40 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr) > /* > * check on proper vm_flags and page table alignment > */ > - if (vma->vm_flags & VM_MAYSHARE && > - vma->vm_start <= base && end <= vma->vm_end) > + if (vma->vm_flags & VM_MAYSHARE && range_in_vma(vma, base, end)) > return true; > return false; > } > > +/* > + * Determine if start,end range within vma could be mapped by shared pmd. > + * If yes, adjust start and end to cover range associated with possible > + * shared pmd mappings. > + */ > +void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma, > + unsigned long *start, unsigned long *end) > +{ > + unsigned long check_addr = *start; > + > + if (!(vma->vm_flags & VM_MAYSHARE)) > + return; > + > + for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) { > + unsigned long a_start = check_addr & PUD_MASK; > + unsigned long a_end = a_start + PUD_SIZE; > + > + /* > + * If sharing is possible, adjust start/end if necessary. > + */ > + if (range_in_vma(vma, a_start, a_end)) { > + if (a_start < *start) > + *start = a_start; > + if (a_end > *end) > + *end = a_end; > + } > + } > +} > + > /* > * Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc() > * and returns the corresponding pte. While this is not necessary for the > @@ -4297,6 +4325,11 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep) > { > return 0; > } > + > +void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma, > + unsigned long *start, unsigned long *end) > +{ > +} > #define want_pmd_share() (0) > #endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */ > > diff --git a/mm/rmap.c b/mm/rmap.c > index 1bceb49aa214..488dda209431 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1324,12 +1324,41 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > pte_t pteval; > spinlock_t *ptl; > int ret = SWAP_AGAIN; > + unsigned long sh_address; > + bool pmd_sharing_possible = false; > + unsigned long spmd_start, spmd_end; > enum ttu_flags flags = (enum ttu_flags)arg; > > /* munlock has nothing to gain from examining un-locked vmas */ > if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED)) > goto out; > > + /* > + * Only use the range_start/end mmu notifiers if huge pmd sharing > + * is possible. In the normal case, mmu_notifier_invalidate_page > + * is sufficient as we only unmap a page. However, if we unshare > + * a pmd, we will unmap a PUD_SIZE range. > + */ > + if (PageHuge(page)) { > + spmd_start = address; > + spmd_end = spmd_start + vma_mmu_pagesize(vma); > + > + /* > + * Check if pmd sharing is possible. If possible, we could > + * unmap a PUD_SIZE range. spmd_start/spmd_end will be > + * modified if sharing is possible. > + */ > + adjust_range_if_pmd_sharing_possible(vma, &spmd_start, > + &spmd_end); > + if (spmd_end - spmd_start != vma_mmu_pagesize(vma)) { > + sh_address = address; > + > + pmd_sharing_possible = true; > + mmu_notifier_invalidate_range_start(vma->vm_mm, > + spmd_start, spmd_end); > + } > + } > + > pte = page_check_address(page, mm, address, &ptl, 0); > if (!pte) > goto out; > @@ -1356,6 +1385,30 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > } > } > > + /* > + * Call huge_pmd_unshare to potentially unshare a huge pmd. Pass > + * sh_address as it will be modified if unsharing is successful. > + */ > + if (PageHuge(page) && huge_pmd_unshare(mm, &sh_address, pte)) { > + /* > + * huge_pmd_unshare unmapped an entire PMD page. There is > + * no way of knowing exactly which PMDs may be cached for > + * this mm, so flush them all. spmd_start/spmd_end cover > + * this PUD_SIZE range. > + */ > + flush_cache_range(vma, spmd_start, spmd_end); > + flush_tlb_range(vma, spmd_start, spmd_end); > + > + /* > + * The ref count of the PMD page was dropped which is part > + * of the way map counting is done for shared PMDs. When > + * there is no other sharing, huge_pmd_unshare returns false > + * and we will unmap the actual page and drop map count > + * to zero. > + */ > + goto out_unmap; > + } > + > /* Nuke the page table entry. */ > flush_cache_page(vma, address, page_to_pfn(page)); > if (should_defer_flush(mm, flags)) { > @@ -1450,6 +1503,9 @@ out_unmap: > if (ret != SWAP_FAIL && ret != SWAP_MLOCK && !(flags & TTU_MUNLOCK)) > mmu_notifier_invalidate_page(mm, address); > out: > + if (pmd_sharing_possible) > + mmu_notifier_invalidate_range_end(vma->vm_mm, > + spmd_start, spmd_end); > return ret; > } > > -- > 2.17.2 Now queued up, thanks. greg k-h