Re: [PATCH 1/4] mm: Trial do_wp_page() simplification

Peter Xu <peterx@xxxxxxxxxx> · Mon, 14 Sep 2020 17:15:15 -0400

On Mon, Sep 14, 2020 at 02:34:36PM -0400, Peter Xu wrote:
> On Mon, Sep 14, 2020 at 10:32:11AM -0700, Linus Torvalds wrote:
> > On Mon, Sep 14, 2020 at 7:38 AM Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
> > >
> > > I don't have a detailed explanation right now, but this patch appears
> > > to be causing a regression where RDMA subsystem tests fail. Tests
> > > return to normal when this patch is reverted.
> > >
> > > It kind of looks like the process is not seeing DMA'd data to a
> > > pin_user_pages()?
> > 
> > I'm a nincompoop. I actually _talked_ to Hugh Dickins about this when
> > he raised concerns, and I dismissed his concerns with "but PAGE_PIN is
> > special".
> > 
> > As usual, Hugh was right. Page pinning certainly _is_ special, but
> > it's not that different from the regular GUP code.
> > 
> > But in the meantime, I have a lovely confirmation from the kernel test
> > robot, saying that commit 09854ba94c results in a
> > "vm-scalability.throughput 31.4% improvement", which was what I was
> > hoping for - the complexity wasn't just complexity, it was active
> > badness due to the page locking horrors.
> > 
> > I think what we want to do is basically do the "early COW", but only
> > do it for FOLL_PIN (and not turn them into writes for anything but the
> > COW code). So basically redo the "enforced COW mechanism", but rather
> > than do it for everything, now do it only for FOLL_PIN, and only in
> > that COW path.
> > 
> > Peter - any chance you can look at this? I'm still looking at the page
> > lock fairness performance regression, although I now think I have a
> > test patch for Phoronix to test out.
> 
> Sure, I'll try to prepare something like that and share it shortly.

Jason, would you please try the attached patch to see whether it unbreaks the
rdma test?  Thanks!

-- 
Peter Xu
>From 93c534866d2c548cf193a5c17f7058a1f770df5a Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@xxxxxxxxxx>
Date: Mon, 14 Sep 2020 15:34:41 -0400
Subject: [PATCH] mm/gup: Allow enfornced COW for FOLL_PIN

FOLL_PIN may need the enforced COW mechanism as reported by Jason and analyzed
by Linus [1].  This is a continued work based on previous patch [2], however
there's some trivial differences.

Instead of applying enforced COW everywhere, we only apply it for FOLL_PIN to
make sure the pages that were pinned will not be COWed again later on.  In
other words, we'll do early phase COW for pinned page along with the gup
procedure.  And since only FOLL_PIN is affected, we don't need to introduce a
new flag as FOLL_BREAK_COW.  However we'll still need a new fault flag as
FAULT_FLAG_BREAK_COW inside the page fault handler.

Fast gup is not affected by this because it is never used with FOLL_PIN.

Now userfaultfd-wp needs to be ready with COW happening since read gup could
trigger COW now with FOLL_PIN (which will never happen previously).  So when
COW happens we'll need to carry over the uffd-wp bits too if it's there.

Meanwhile, both userfaultfd_pte_wp() and userfaultfd_huge_pmd_wp() need to be
smarter than before on that it needs to return true only if this is a "real"
write fault.  With that extra check, we can identify a real write against an
enforced COW procedure from a FOLL_PIN gup.

Note: hugetlbfs is not considered throughout this patch, because it's missing
some required bits after all (like proper setting of FOLL_COW when page fault
retries).  Considering we may want to unbreak RDMA tests even during the rcs,
this patch only fixes the non-hugetlbfs cases. THPs should still be in count.

[1] https://lore.kernel.org/lkml/20200914143829.GA1424636@xxxxxxxxxx
[2] https://lore.kernel.org/lkml/20200811183950.10603-1-peterx@xxxxxxxxxx

Reported-by: Jason Gunthorpe <jgg@xxxxxxxxxx>
Suggested-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Signed-off-by: Peter Xu <peterx@xxxxxxxxxx>
---
 include/linux/mm.h            |  2 ++
 include/linux/userfaultfd_k.h | 12 ++++++------
 mm/gup.c                      | 17 ++++++++++++-----
 mm/huge_memory.c              | 17 ++++++++++++-----
 mm/memory.c                   | 16 +++++++++-------
 5 files changed, 41 insertions(+), 23 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ca6e6a81576b..741574bfd343 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -416,6 +416,7 @@ extern pgprot_t protection_map[16];
  * @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
  * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
  * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
+ * @FAULT_FLAG_BREAK_COW: Do COW explicitly for the fault (even for read).
  *
  * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
  * whether we would allow page faults to retry by specifying these two
@@ -446,6 +447,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_REMOTE			0x80
 #define FAULT_FLAG_INSTRUCTION  		0x100
 #define FAULT_FLAG_INTERRUPTIBLE		0x200
+#define FAULT_FLAG_BREAK_COW			0x400
 
 /*
  * The default fault flags that should be used by most of the
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index a8e5f3ea9bb2..fbcb75daf870 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -62,16 +62,16 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma)
 	return vma->vm_flags & VM_UFFD_WP;
 }
 
-static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
-				      pte_t pte)
+static inline bool userfaultfd_pte_wp(struct vm_fault *vmf, pte_t pte)
 {
-	return userfaultfd_wp(vma) && pte_uffd_wp(pte);
+	return (vmf->flags & FAULT_FLAG_WRITE) &&
+	    userfaultfd_wp(vmf->vma) && pte_uffd_wp(pte);
 }
 
-static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
-					   pmd_t pmd)
+static inline bool userfaultfd_huge_pmd_wp(struct vm_fault *vmf, pmd_t pmd)
 {
-	return userfaultfd_wp(vma) && pmd_uffd_wp(pmd);
+	return (vmf->flags & FAULT_FLAG_WRITE) &&
+	    userfaultfd_wp(vmf->vma) && pmd_uffd_wp(pmd);
 }
 
 static inline bool userfaultfd_armed(struct vm_area_struct *vma)
diff --git a/mm/gup.c b/mm/gup.c
index e5739a1974d5..2011542c4044 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -381,13 +381,13 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
 }
 
 /*
- * FOLL_FORCE can write to even unwritable pte's, but only
- * after we've gone through a COW cycle and they are dirty.
+ * FOLL_FORCE (or FOLL_PIN, which requires enforced COW even for read) can
+ * write to even unwritable pte's, but only after we've gone through a COW
+ * cycle and they are dirty.
  */
 static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
 {
-	return pte_write(pte) ||
-		((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
+	return pte_write(pte) || ((flags & FOLL_COW) && pte_dirty(pte));
 }
 
 static struct page *follow_page_pte(struct vm_area_struct *vma,
@@ -430,7 +430,12 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	}
 	if ((flags & FOLL_NUMA) && pte_protnone(pte))
 		goto no_page;
-	if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) {
+	/*
+	 * We need the same write permission for FOLL_PIN pages, so that we
+	 * will guarantee to trigger early COW for pinned pages.
+	 */
+	if ((flags & (FOLL_WRITE | FOLL_PIN)) &&
+	    !can_follow_write_pte(pte, flags)) {
 		pte_unmap_unlock(ptep, ptl);
 		return NULL;
 	}
@@ -874,6 +879,8 @@ static int faultin_page(struct vm_area_struct *vma,
 		 */
 		fault_flags |= FAULT_FLAG_TRIED;
 	}
+	if (*flags & FOLL_PIN)
+		fault_flags |= FAULT_FLAG_BREAK_COW;
 
 	ret = handle_mm_fault(vma, address, fault_flags, NULL);
 	if (ret & VM_FAULT_ERROR) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7ff29cc3d55c..e927e968952b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1276,6 +1276,8 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 		pmd_t entry;
 		entry = pmd_mkyoung(orig_pmd);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		if (pmd_uffd_wp(orig_pmd))
+			entry = pmd_mkuffd_wp(pmd_wrprotect(entry));
 		if (pmdp_set_access_flags(vma, haddr, vmf->pmd, entry, 1))
 			update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
 		unlock_page(page);
@@ -1291,13 +1293,13 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 }
 
 /*
- * FOLL_FORCE can write to even unwritable pmd's, but only
- * after we've gone through a COW cycle and they are dirty.
+ * FOLL_FORCE (or FOLL_PIN, which requires enforced COW even for read) can
+ * write to even unwritable pte's, but only after we've gone through a COW
+ * cycle and they are dirty.
  */
 static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
 {
-	return pmd_write(pmd) ||
-	       ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
+	return pmd_write(pmd) || ((flags & FOLL_COW) && pmd_dirty(pmd));
 }
 
 struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
@@ -1310,7 +1312,12 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 
 	assert_spin_locked(pmd_lockptr(mm, pmd));
 
-	if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags))
+	/*
+	 * We need the same write permission for FOLL_PIN pages, so that we
+	 * will guarantee to trigger early COW for pinned pages.
+	 */
+	if ((flags & (FOLL_WRITE | FOLL_PIN)) &&
+	    !can_follow_write_pmd(*pmd, flags))
 		goto out;
 
 	/* Avoid dumping huge zero page */
diff --git a/mm/memory.c b/mm/memory.c
index 469af373ae76..68aadb9c18f5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2719,6 +2719,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		entry = mk_pte(new_page, vma->vm_page_prot);
 		entry = pte_sw_mkyoung(entry);
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		if (pte_uffd_wp(vmf->orig_pte))
+			entry = pte_mkuffd_wp(pte_wrprotect(entry));
 		/*
 		 * Clear the pte entry and flush it first, before updating the
 		 * pte with the new entry. This will avoid a race condition
@@ -2912,7 +2914,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 
-	if (userfaultfd_pte_wp(vma, *vmf->pte)) {
+	if (userfaultfd_pte_wp(vmf, *vmf->pte)) {
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 		return handle_userfault(vmf, VM_UFFD_WP);
 	}
@@ -3273,7 +3275,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		put_page(swapcache);
 	}
 
-	if (vmf->flags & FAULT_FLAG_WRITE) {
+	if (vmf->flags & (FAULT_FLAG_WRITE | FAULT_FLAG_BREAK_COW)) {
 		ret |= do_wp_page(vmf);
 		if (ret & VM_FAULT_ERROR)
 			ret &= VM_FAULT_ERROR;
@@ -4100,7 +4102,7 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
 static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd)
 {
 	if (vma_is_anonymous(vmf->vma)) {
-		if (userfaultfd_huge_pmd_wp(vmf->vma, orig_pmd))
+		if (userfaultfd_huge_pmd_wp(vmf, orig_pmd))
 			return handle_userfault(vmf, VM_UFFD_WP);
 		return do_huge_pmd_wp_page(vmf, orig_pmd);
 	}
@@ -4224,7 +4226,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 		update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
 		goto unlock;
 	}
-	if (vmf->flags & FAULT_FLAG_WRITE) {
+	if (vmf->flags & (FAULT_FLAG_WRITE | FAULT_FLAG_BREAK_COW)) {
 		if (!pte_write(entry))
 			return do_wp_page(vmf);
 		entry = pte_mkdirty(entry);
@@ -4267,7 +4269,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		.pgoff = linear_page_index(vma, address),
 		.gfp_mask = __get_fault_gfp_mask(vma),
 	};
-	unsigned int dirty = flags & FAULT_FLAG_WRITE;
+	bool cow = flags & (FAULT_FLAG_WRITE | FAULT_FLAG_BREAK_COW);
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -4294,7 +4296,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 
 			/* NUMA case for anonymous PUDs would go here */
 
-			if (dirty && !pud_write(orig_pud)) {
+			if (cow && !pud_write(orig_pud)) {
 				ret = wp_huge_pud(&vmf, orig_pud);
 				if (!(ret & VM_FAULT_FALLBACK))
 					return ret;
@@ -4332,7 +4334,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
 				return do_huge_pmd_numa_page(&vmf, orig_pmd);
 
-			if (dirty && !pmd_write(orig_pmd)) {
+			if (cow && !pmd_write(orig_pmd)) {
 				ret = wp_huge_pmd(&vmf, orig_pmd);
 				if (!(ret & VM_FAULT_FALLBACK))
 					return ret;
-- 
2.26.2