Re: [PATCH mm-unstable v1] mm: don't check VMA write permissions if the PTE/PMD indicates write permissions

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Apr 18, 2023 at 11:56:07AM -0400, Peter Xu wrote:
> On Tue, Apr 18, 2023 at 04:21:13PM +0200, David Hildenbrand wrote:
> > Staring at the comment "Recheck VMA as permissions can change since
> > migration started" in remove_migration_pte() can result in confusion,
> > because if the source PTE/PMD indicates write permissions, then there
> > should be no need to check VMA write permissions when restoring migration
> > entries or PTE-mapping a PMD.
> > 
> > Commit d3cb8bf6081b ("mm: migrate: Close race between migration completion
> > and mprotect") introduced the maybe_mkwrite() handling in
> > remove_migration_pte() in 2014, stating that a race between mprotect() and
> > migration finishing would be possible, and that we could end up with
> > a writable PTE that should be readable.
> > 
> > However, mprotect() code first updates vma->vm_flags / vma->vm_page_prot
> > and then walks the page tables to (a) set all present writable PTEs to
> > read-only and (b) convert all writable migration entries to readable
> > migration entries. While walking the page tables and modifying the
> > entries, migration code has to grab the PT locks to synchronize against
> > concurrent page table modifications.
> 
> Makes sense to me.
> 
> > 
> > Assuming migration would find a writable migration entry (while holding
> > the PT lock) and replace it with a writable present PTE, surely mprotect()
> > code didn't stumble over the writable migration entry yet (converting it
> > into a readable migration entry) and would instead wait for the PT lock to
> > convert the now present writable PTE into a read-only PTE. As mprotect()
> > didn't finish yet, the behavior is just like migration didn't happen: a
> > writable PTE will be converted to a read-only PTE.
> > 
> > So it's fine to rely on the writability information in the source
> > PTE/PMD and not recheck against the VMA as long as we're holding the PT
> > lock to synchronize with anyone who concurrently wants to downgrade write
> > permissions (like mprotect()) by first adjusting vma->vm_flags /
> > vma->vm_page_prot to then walk over the page tables to adjust the page
> > table entries.
> > 
> > Running test cases that should reveal such races -- mprotect(PROT_READ)
> > racing with page migration or THP splitting -- for multiple hours did
> > not reveal an issue with this cleanup.
> > 
> > Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> > Cc: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
> > Cc: Peter Xu <peterx@xxxxxxxxxx>
> > Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>
> > ---
> > 
> > This is a follow-up cleanup to [1]:
> > 	[PATCH v1 RESEND 0/6] mm: (pte|pmd)_mkdirty() should not
> > 	unconditionally allow for write access
> > 
> > I wanted to be a bit careful and write some test cases to convince myself
> > that I am not missing something important. Of course, there is still the
> > possibility that my test cases are buggy ;)
> > 
> > Test cases I'm running:
> > 	https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/test_mprotect_migration.c
> > 	https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/test_mprotect_thp_split.c
> > 
> > 
> > [1] https://lkml.kernel.org/r/20230411142512.438404-1-david@xxxxxxxxxx
> > 
> > ---
> >  mm/huge_memory.c | 4 ++--
> >  mm/migrate.c     | 5 +----
> >  2 files changed, 3 insertions(+), 6 deletions(-)
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index c23fa39dec92..624671aaa60d 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2234,7 +2234,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> >  		} else {
> >  			entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot));
> >  			if (write)
> > -				entry = maybe_mkwrite(entry, vma);
> > +				entry = pte_mkwrite(entry);
> 
> This is another change besides page migration.  I also don't know why it's
> needed, but it's there since day 1 of thp split in eef1b3ba053, so maybe
> worthwhile to copy Kirill too (which I did).

I was concentrated on the correctness at the point and this small
inefficency didn't catch my eyes.

I was curious how we serialize here against mprotect().

Looks safe to me:

	CPU0					CPU1

__split_huge_pmd()
  pmd_lock()
  __split_huge_pmd_locked()
    pmdp_invalidate()
    // PMD is non-present, but huge at this point
 						change_protection()
						  change_pmd_range()
						    pmd_none_or_clear_bad_unless_trans_huge() == 0 // not skipped
						    change_huge_pmd()
						      __pmd_trans_huge_lock()
						        pmd_lock() // serialized against __split_huge_pmd()

Acked-by: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx>

-- 
  Kiryl Shutsemau / Kirill A. Shutemov




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux