Re: [patch 128/192] mm: improve mprotect(R|W) efficiency on pages referenced once

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jun 29, 2021 at 07:25:42PM -0700, Linus Torvalds wrote:
> On Tue, Jun 29, 2021 at 6:39 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
> >
> > And since MM_CP_DIRTY_ACCT implies "VM_WRITE|VM_SHARED" all set, above should
> > be a slightly faster version of below:
> 
> That's way too subtle, particularly since the MM_CP_DIRTY_ACCT logic
> comes from another file entirely.
> 
> I don't think it's even faster, considering that presumably the
> anonymous mapping case is the common one, and that's the one that
> needs all the extra tests, it's likely better to _not_ test that very
> subtle flag at all, and just doing the straightforward and obvious
> tests that are understandable _locally_.
> 
> So I claim that it's
> 
>  (a) not an optimization at all
> 
>  (b) completely locally unintuitive and unreadable
> 
> > Again, I think in all cases some more comment should be good indeed..
> 
> I really want more than a comment. I want that MM_CP_DIRTY_ACCT bit
> testing gone.

My understanding is that MM_CP_DIRTY_ACCT contains all check results from
vma_wants_writenotify(), so if we drop it we'd need to have something like that
to be checked within change_pte_range(), which is again slower (I have totally
no idea how slow to check vma->vm_flags & VM_WRITE, but moving the whole
vma_wants_writenotify here is definitely even slower).

> 
> The only point where it makes sense to check MM_CP_DIRTY_ACCT is
> within the context of "is the page already dirty".
> 
> So I think the logic should be something along the lines of
> 
>  - first:
> 
>          if (!(vma->vm_flags & VM_WRITE))
>                 return false;
> 
>    because that logic is set in stone, and true regardless of anything
> else. If the vma isn't writable, we're not going to set the write bit.
> End of story.
> 
>  - then, check the vma_is_anonumous() case:
> 
>         if (vma_is_anonymous(vma))
>                 return page_count(pte_page(pte)) == 1;
> 
>      because if it's a writable mapping, and anonymous, then we can
> mark it writable if we're the exclusive owners of that page.

Shouldn't we still at least checks [soft-]dirty bits and uffd-wp bits to make
sure it's either not dirty tracked or uffd wr-protected?  Say, IMHO it's
possible that soft-dirty tracking enabled on this anonymous vma range, then we
still depend on the write bit removed to set the soft-dirty later in the fault
handler.

> 
>  - and THEN we can handle the "ok, shared mapping, now let's start
> thinking about dirty accounting" cases.
> 
> Make it obvious and correct. This is not a sequence where you should
> try to (incorrectly) optimize away individual instructions.

Yes I still fully agree it's very un-obvious.  So far the best thing I can come
up with is something like below (patch attached too but not yet tested). I
moved VM_WRITE out so hopefully it'll be very clear; then I also rearranged the
checks so the final outcome looks like below:

static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma,
				  unsigned long cp_flags)
{
	/*
	 * It is unclear whether this optimization can be done safely for NUMA
	 * pages.
	 */
	if (cp_flags & MM_CP_PROT_NUMA)
		return false;

	/*
	 * Never apply write bit if VM_WRITE not set.  Note that this is
	 * actually checked for VM_SHARED when MM_CP_DIRTY_ACCT is set, so
	 * logically we only need to check it for !MM_CP_DIRTY_ACCT, but just
	 * make it even more obvious.
	 */
	if (!(vma->vm_flags & VM_WRITE))
		return false;

	/*
	 * Don't do this optimization for clean pages as we need to be notified
	 * of the transition from clean to dirty.
	 */
	if (!pte_dirty(pte))
		return false;

	/* Same for softdirty. */
	if (!pte_soft_dirty(pte) && (vma->vm_flags & VM_SOFTDIRTY))
		return false;

	/*
	 * For userfaultfd the user program needs to monitor write faults so we
	 * can't do this optimization.
	 */
	if (pte_uffd_wp(pte))
		return false;

	/*
	 * MM_CP_DIRTY_ACCT indicates that we can always make the page writable
	 * regardless of the number of references.  Time to set the write bit.
	 */
	if (cp_flags & MM_CP_DIRTY_ACCT)
		return true;

	/*
	 * Othewise it means !MM_CP_DIRTY_ACCT.  We can only apply write bit
	 * early if it's anonymous page and we exclusively own it.
	 */
	if (vma_is_anonymous(vma) && (page_count(pte_page(pte)) == 1))
		return true;

	/* Don't play any trick */
	return false;
}

The logic should be the same as before, it's just that we'll do an extra check
on VM_WRITE for MM_CP_DIRTY_ACCT but assuming it's ok.

Another side note is that I still think the VM_SOFTDIRTY check is wrong in
may_avoid_write_fault() and even in the old code (I mentioned it previously
when reviewing the patch), as !VM_SOFTDIRTY should mean soft dirty tracking
enabled while VM_SOFTDIRTY means disabled.  So I wonder whether it should be:

-       if (!pte_soft_dirty(pte) && (vma->vm_flags & VM_SOFTDIRTY))
+       if (!pte_soft_dirty(pte) && !(vma->vm_flags & VM_SOFTDIRTY))

However I didn't touch it up there as it may need more justifications (I feel
it's okay in the old code, as vma_wants_writenotify actually checks it too and
in the right way; however after the anonymous fast path it seems to prone to
error if it's anonymous; I'll check later).

Thanks,

-- 
Peter Xu
>From 4fb32ad7c949d5ec6b6ea364d3388b50bf674c9c Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@xxxxxxxxxx>
Date: Wed, 30 Jun 2021 12:20:12 -0400
Subject: [PATCH] mm/mprotect: Optimize layout of may_avoid_write_fault()

Firstly move VM_WRITE check to be outside of !MM_CP_DIRTY_ACCT chunk, so as to
make it clear that we won't accidentally set the write bit to !VM_WRITE vmas.

The old logic is hard to read in that it was written in reversed logic.  Put
things backward by moving the soft-dirty and uffd-wp checks earlier.  Make the
NUMA check even earlier than those as it's a cheap check and straightforward.

Make the only "return true" case to be either the MM_CP_DIRTY_ACCT (which
stands for the VM_SHARED cases when write bit can be applied), or the special
anonymous page when we exclusively own it.

Signed-off-by: Peter Xu <peterx@xxxxxxxxxx>
---
 mm/mprotect.c | 39 +++++++++++++++++++++++++--------------
 1 file changed, 25 insertions(+), 14 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 4cb240fd9936..3977bfd55f62 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -40,17 +40,20 @@ static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma,
 				  unsigned long cp_flags)
 {
 	/*
-	 * The dirty accountable bit indicates that we can always make the page
-	 * writable regardless of the number of references.
+	 * It is unclear whether this optimization can be done safely for NUMA
+	 * pages.
 	 */
-	if (!(cp_flags & MM_CP_DIRTY_ACCT)) {
-		/* Otherwise, we must have exclusive access to the page. */
-		if (!(vma_is_anonymous(vma) && (vma->vm_flags & VM_WRITE)))
-			return false;
+	if (cp_flags & MM_CP_PROT_NUMA)
+		return false;
 
-		if (page_count(pte_page(pte)) != 1)
-			return false;
-	}
+	/*
+	 * Never apply write bit if VM_WRITE not set.  Note that this is
+	 * actually checked for VM_SHARED when MM_CP_DIRTY_ACCT is set, so
+	 * logically we only need to check it for !MM_CP_DIRTY_ACCT, but just
+	 * make it even more obvious.
+	 */
+	if (!(vma->vm_flags & VM_WRITE))
+		return false;
 
 	/*
 	 * Don't do this optimization for clean pages as we need to be notified
@@ -71,13 +74,21 @@ static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma,
 		return false;
 
 	/*
-	 * It is unclear whether this optimization can be done safely for NUMA
-	 * pages.
+	 * MM_CP_DIRTY_ACCT indicates that we can always make the page writable
+	 * regardless of the number of references.  Time to set the write bit.
 	 */
-	if (cp_flags & MM_CP_PROT_NUMA)
-		return false;
+	if (cp_flags & MM_CP_DIRTY_ACCT)
+		return true;
+
+	/*
+	 * Othewise it means !MM_CP_DIRTY_ACCT.  We can only apply write bit
+	 * early if it's anonymous page and we exclusively own it.
+	 */
+	if (vma_is_anonymous(vma) && (page_count(pte_page(pte)) == 1))
+		return true;
 
-	return true;
+	/* Don't play any trick */
+	return false;
 }
 
 static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
-- 
2.31.1


[Index of Archives]     [Kernel Archive]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]

  Powered by Linux