Re: [patch 128/192] mm: improve mprotect(R|W) efficiency on pages referenced once

Peter Xu <peterx@xxxxxxxxxx> · Wed, 30 Jun 2021 21:27:12 -0400

On Wed, Jun 30, 2021 at 11:03:25AM -0700, Linus Torvalds wrote:
> On Wed, Jun 30, 2021 at 9:42 AM Peter Xu <peterx@xxxxxxxxxx> wrote:
> >
> > Yes I still fully agree it's very un-obvious.  So far the best thing I can come
> > up with is something like below (patch attached too but not yet tested). I
> > moved VM_WRITE out so hopefully it'll be very clear; then I also rearranged the
> > checks so the final outcome looks like below:
> >
> > static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma,
> >                                   unsigned long cp_flags)
> > {
> >         /*
> >          * It is unclear whether this optimization can be done safely for NUMA
> >          * pages.
> >          */
> >         if (cp_flags & MM_CP_PROT_NUMA)
> >                 return false;
> 
> Please just put that VM_WRITE test first. It's the one that really
> *matters*. There's no "it's unclear if" about that part. Just handle
> the obvious and important check first.
> 
> Yeah, yeah, they both return false, so order doesn't matter from a
> semantic standpoint, but from a clarity standpoint just do the clear
> and unambiguous and security-relevant test first.
> 
> The rest of the tests are implementation details, the VM_WRITE test is
> fundamental behavior. It's the one that made me worry about this patch
> in the first place.

Sure.

> 
> >         /*
> >          * Don't do this optimization for clean pages as we need to be notified
> >          * of the transition from clean to dirty.
> >          */
> >         if (!pte_dirty(pte))
> >                 return false;
> >
> >         /* Same for softdirty. */
> >         if (!pte_soft_dirty(pte) && (vma->vm_flags & VM_SOFTDIRTY))
> >                 return false;
> >
> >         /*
> >          * For userfaultfd the user program needs to monitor write faults so we
> >          * can't do this optimization.
> >          */
> >         if (pte_uffd_wp(pte))
> >                 return false;
> 
> So all of these are a bit special.
> 
> Why? Because if I look at the actual page fault path, these are not
> the tests there.
> 
> I'd really like to have some obvious situation where we keep this
> "make it writable" in sync with what would actually happen on a write
> fault when it's not writable.
> 
> And it's not at all obvious to me for these cases.
> 
> The do_wp_page() code doesn't even use pte_uffd_wp(). It uses
> userfaultfd_pte_wp(vma, pte), and I don't even know why. Yes, I can
> see the code (it additionally tests the VM_UFFD_WP flag in the vma),
> but a number of other paths then only do that pte_uffd_wp() test.

The vma check is a safety net to make sure it's not the case e.g. when the vma
has already unregistered from uffd-wp while there's uffd-wp bit left over.
E.g., currently UFFDIO_UNREGISTER is lazy on removing uffd-wp bits that applied
to ptes, so even vma is unregistered there could still have pte_uffd_wp() being
true for some ptes.  That vma check makes sure when it happens the uffd-wp bit
will be auto-removed.

> 
> I get the feeling that we really should try to match what the
> do_wp_page() path does, though.

Makes sense.

> 
> Which brings up another issue: the do_wp_page() path treats PageKsm()
> pages differently. And it locks the page before double-checking the
> page count.
> 
> Why does mprotect() not need to do the same thing? I think this has
> come up before, and "change_protection()" can get called with the
> mmap_sem held just for reading - see userfaultfd - so it has all the
> same issues as a page fault does, afaik.

Good point..  I overlooked ksm when reviewing the patch, while I should really
have looked at do_wp_page() as you suggested (hmm.. the truth is I wasn't even
aware of this patch and never planned to try to review it, until it breaks the
uffd-wp anonymous tests in its initial versions when I was testing mmots...).

Maybe something like this (to be squashed into the previously attached patch):

---8<---

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 3977bfd55f62..7aab30ac9c9f 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -39,12 +39,8 @@
 static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma,
                                  unsigned long cp_flags)
 {
-       /*
-        * It is unclear whether this optimization can be done safely for NUMA
-        * pages.
-        */
-       if (cp_flags & MM_CP_PROT_NUMA)
-               return false;
+       struct page *page;
+       bool ret = false;

        /*
         * Never apply write bit if VM_WRITE not set.  Note that this is
@@ -55,6 +51,13 @@ static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma,
        if (!(vma->vm_flags & VM_WRITE))
                return false;

+       /*
+        * It is unclear whether this optimization can be done safely for NUMA
+        * pages.
+        */
+       if (cp_flags & MM_CP_PROT_NUMA)
+               return false;
+
        /*
         * Don't do this optimization for clean pages as we need to be notified
         * of the transition from clean to dirty.
@@ -80,15 +83,22 @@ static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma,
        if (cp_flags & MM_CP_DIRTY_ACCT)
                return true;

+       page = pte_page(pte);
+       /* Best effort to take page lock, don't play trick if failed */
+       if (!trylock_page(page))
+               return false;
+       /* KSM pages needs COW; leave them be */
+       if (PageKsm(page))
+               goto unlock_fail;
        /*
-        * Othewise it means !MM_CP_DIRTY_ACCT.  We can only apply write bit
-        * early if it's anonymous page and we exclusively own it.
+        * Othewise it means !MM_CP_DIRTY_ACCT and !KSM.  We can only apply
+        * write bit early if it's anonymous page and we exclusively own it.
         */
-       if (vma_is_anonymous(vma) && (page_count(pte_page(pte)) == 1))
-               return true;
-
-       /* Don't play any trick */
-       return false;
+       if (vma_is_anonymous(vma) && (page_count(page) == 1))
+               ret = true;
+unlock_fail:
+       unlock_page(page);
+       return ret;
 }

 static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
---8<---

I hope I didn't overlook something else..

Today when I was looking at ksm code, I found that I got lost on why we say
"PageKsm() doesn't necessarily raise the page refcount", as in do_wp_page().  I
was looking at replace_page() where, afaict, we still do proper refcounting for
the stable nodes with "get_page(kpage)".

I know I must missed something but I can't quickly tell.  In all cases with
above PageKsm check I think it'll be safe, and it's definitely clear that page
lock will stablize PageKsm()'s return value, like do_wp_page().

> 
> >         /*
> >          * MM_CP_DIRTY_ACCT indicates that we can always make the page writable
> >          * regardless of the number of references.  Time to set the write bit.
> >          */
> >         if (cp_flags & MM_CP_DIRTY_ACCT)
> >                 return true;
> >
> >         /*
> >          * Othewise it means !MM_CP_DIRTY_ACCT.  We can only apply write bit
> >          * early if it's anonymous page and we exclusively own it.
> >          */
> >         if (vma_is_anonymous(vma) && (page_count(pte_page(pte)) == 1))
> >                 return true;
> >
> >         /* Don't play any trick */
> >         return false;
> > }
> >
> > The logic should be the same as before, it's just that we'll do an extra check
> > on VM_WRITE for MM_CP_DIRTY_ACCT but assuming it's ok.
> 
> See above. I don't think the logic before was all that clear either.
> 
> The one case that is clear is that if it's a shared mapping, and
> MM_CP_DIRTY_ACCT is set, and it was already dirty (and softdirty),
> then it's ok.,
> 
> That's the old code.  I don't like how the old code was written
> (because I think that MM_CP_DIRTY_ACCT bit wasx too subtle), but I
> think the old code was at least correct.
> 
> The new code, it just worries me. It adds all those new cases for when
> we can make the page writable early - that's the whole point of the
> patch, after all - but my point here is that it's not at all obvious
> that those new cases are actually correct.

Yes agreed the MM_CP_DIRTY_ACCT bit is very subtle and not easy to get.  It's
just that I don't have a good idea to make it better, yet..

> 
> MAYBE it's all correct. I'm not saying it's wrong. I'm just saying
> it's not _obvious_ that it's correct.
> 
> What about that page_count() test, for example: it has a comment, it
> looks obvious, but it's very different from what do_wp_page() does. So
> what happens if we have a page-out at the same time that turns that
> page into a swap cache page, and increments the page count? What about
> that race? Do we end up with a writable page that is shared with a
> swap cache entry? Is that ok? Why isn't it ok in the page fault case?

That looks fine to me: when the race happens we must have checked page_count==1
first and granted the write bit, then add_to_swap_cache() happens after the
page_count==1 check (as it takes another refcount, so >2 otherwise).  Then it
also means unmap mappings should happen even after that point.  If my above
understanding is correct, our newly installed pte will be zapped safely soon,
but of course after we release the pgtable lock in change_pte_range().

Thanks,

-- 
Peter Xu