Re: [patch 128/192] mm: improve mprotect(R|W) efficiency on pages referenced once

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Thu, 1 Jul 2021 11:29:50 -0700

On Wed, Jun 30, 2021 at 6:27 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
>
> On Wed, Jun 30, 2021 at 11:03:25AM -0700, Linus Torvalds wrote:
> >
> > What about that page_count() test, for example: it has a comment, it
> > looks obvious, but it's very different from what do_wp_page() does. So
> > what happens if we have a page-out at the same time that turns that
> > page into a swap cache page, and increments the page count? What about
> > that race? Do we end up with a writable page that is shared with a
> > swap cache entry? Is that ok? Why isn't it ok in the page fault case?
>
> That looks fine to me: when the race happens we must have checked page_count==1
> first and granted the write bit, then add_to_swap_cache() happens after the
> page_count==1 check (as it takes another refcount, so >2 otherwise).  Then it
> also means unmap mappings should happen even after that point.  If my above
> understanding is correct, our newly installed pte will be zapped safely soon,
> but of course after we release the pgtable lock in change_pte_range().

So if this is fine, then maybe we should just remove the page lock in
the do_wp_page() path (and remove the PageKSM check while at it)?

If it's not required by mprotect() to say "I can make the page
writable directly", then it really shouldn't be required by the page
fault path either.

Which I'd love to do, and was really itching to do (it's a nasty
lock), but I worried about it..

I'd hate to have mprotect do one thing, and page faulting do another
thing, and not have some logic to why they have to be different.

                  Linus