Re: [patch 128/192] mm: improve mprotect(R|W) efficiency on pages referenced once

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Wed, 30 Jun 2021 11:03:25 -0700

On Wed, Jun 30, 2021 at 9:42 AM Peter Xu <peterx@xxxxxxxxxx> wrote:
>
> Yes I still fully agree it's very un-obvious.  So far the best thing I can come
> up with is something like below (patch attached too but not yet tested). I
> moved VM_WRITE out so hopefully it'll be very clear; then I also rearranged the
> checks so the final outcome looks like below:
>
> static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma,
>                                   unsigned long cp_flags)
> {
>         /*
>          * It is unclear whether this optimization can be done safely for NUMA
>          * pages.
>          */
>         if (cp_flags & MM_CP_PROT_NUMA)
>                 return false;

Please just put that VM_WRITE test first. It's the one that really
*matters*. There's no "it's unclear if" about that part. Just handle
the obvious and important check first.

Yeah, yeah, they both return false, so order doesn't matter from a
semantic standpoint, but from a clarity standpoint just do the clear
and unambiguous and security-relevant test first.

The rest of the tests are implementation details, the VM_WRITE test is
fundamental behavior. It's the one that made me worry about this patch
in the first place.

>         /*
>          * Don't do this optimization for clean pages as we need to be notified
>          * of the transition from clean to dirty.
>          */
>         if (!pte_dirty(pte))
>                 return false;
>
>         /* Same for softdirty. */
>         if (!pte_soft_dirty(pte) && (vma->vm_flags & VM_SOFTDIRTY))
>                 return false;
>
>         /*
>          * For userfaultfd the user program needs to monitor write faults so we
>          * can't do this optimization.
>          */
>         if (pte_uffd_wp(pte))
>                 return false;

So all of these are a bit special.

Why? Because if I look at the actual page fault path, these are not
the tests there.

I'd really like to have some obvious situation where we keep this
"make it writable" in sync with what would actually happen on a write
fault when it's not writable.

And it's not at all obvious to me for these cases.

The do_wp_page() code doesn't even use pte_uffd_wp(). It uses
userfaultfd_pte_wp(vma, pte), and I don't even know why. Yes, I can
see the code (it additionally tests the VM_UFFD_WP flag in the vma),
but a number of other paths then only do that pte_uffd_wp() test.

I get the feeling that we really should try to match what the
do_wp_page() path does, though.

Which brings up another issue: the do_wp_page() path treats PageKsm()
pages differently. And it locks the page before double-checking the
page count.

Why does mprotect() not need to do the same thing? I think this has
come up before, and "change_protection()" can get called with the
mmap_sem held just for reading - see userfaultfd - so it has all the
same issues as a page fault does, afaik.

>         /*
>          * MM_CP_DIRTY_ACCT indicates that we can always make the page writable
>          * regardless of the number of references.  Time to set the write bit.
>          */
>         if (cp_flags & MM_CP_DIRTY_ACCT)
>                 return true;
>
>         /*
>          * Othewise it means !MM_CP_DIRTY_ACCT.  We can only apply write bit
>          * early if it's anonymous page and we exclusively own it.
>          */
>         if (vma_is_anonymous(vma) && (page_count(pte_page(pte)) == 1))
>                 return true;
>
>         /* Don't play any trick */
>         return false;
> }
>
> The logic should be the same as before, it's just that we'll do an extra check
> on VM_WRITE for MM_CP_DIRTY_ACCT but assuming it's ok.

See above. I don't think the logic before was all that clear either.

The one case that is clear is that if it's a shared mapping, and
MM_CP_DIRTY_ACCT is set, and it was already dirty (and softdirty),
then it's ok.,

That's the old code.  I don't like how the old code was written
(because I think that MM_CP_DIRTY_ACCT bit wasx too subtle), but I
think the old code was at least correct.

The new code, it just worries me. It adds all those new cases for when
we can make the page writable early - that's the whole point of the
patch, after all - but my point here is that it's not at all obvious
that those new cases are actually correct.

MAYBE it's all correct. I'm not saying it's wrong. I'm just saying
it's not _obvious_ that it's correct.

What about that page_count() test, for example: it has a comment, it
looks obvious, but it's very different from what do_wp_page() does. So
what happens if we have a page-out at the same time that turns that
page into a swap cache page, and increments the page count? What about
that race? Do we end up with a writable page that is shared with a
swap cache entry? Is that ok? Why isn't it ok in the page fault case?

See why this patch worries me so much?

                       Linus