Re: [PATCH 2/2] mm: vma_merge: fix race vm_page_prot race condition against rmap_walk

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Fri, 16 Sep 2016 22:54:41 +0200

On Fri, Sep 16, 2016 at 11:42:34AM -0700, Hugh Dickins wrote:
> >  	if (next && !insert) {
> > -		struct vm_area_struct *exporter = NULL, *importer = NULL;
> > +		struct vm_area_struct *exporter = NULL;
> >  
> >  		if (end >= next->vm_end) {
> >  			/*
> > @@ -729,6 +730,17 @@ again:
> >  			vma_interval_tree_remove(next, root);
> >  	}
> >  
> > +	if (importer == vma) {
> > +		/*
> > +		 * vm_page_prot and vm_flags can be read by the
> > +		 * rmap_walk, for example in
> > +		 * remove_migration_ptes(). Before releasing the rmap
> > +		 * locks the current vma must match the next that we
> > +		 * merged with for those fields.
> > +		 */
> > +		importer->vm_page_prot = next->vm_page_prot;
> > +		importer->vm_flags = next->vm_flags;
> > +	}
> 
> To take a concrete example for my doubt, "importer == vma" includes
> case 5 (see the diagrams above vma_merge()), but this then copies
> protections and flags from N to P ("P" being "vma" here), doesn't it?
> 
> Which would not be right, unless I'm confused - which is also very
> much possible.

I start to think it may be cleaner to pass the "vma" to copy the
protections from, as parameter of vma_adjust. I'll try without first,
but it'll add up to the knowledge of the caller details. At least this
code isn't changing often.

> For the moment I'm throwing this back to you without thinking more
> carefully about it, and assuming that either you'll come back with
> a new patch, or will point out my confusion.  But if you'd prefer
> me to take it over, do say so - you have the advantage of youth,
> I have the advantage of having written this code a long time ago,
> I'm not sure which of us is ahead :)

Up to you, we've been playing with this for days (most time spent
actually on NUMA/THP code until I figured it was something completely
different than initially though) so I've no problem in updating this
quick and testing it.

> Is it perhaps just case 8 (see "Odd one out" comment) that's a problem?

Case 8 is the one triggering the bug yes. And that got fixed 100% by
the patch. Unfortunately I agree I'm shifting the issue to case 5 and
so an update of this patch is in order...

> But I'm also worried about whether we shall need to try harder in the
> "remove_next == 2" case (I think that's case 6 in the diagrams): where
> for a moment vma_adjust() drops the locks with overlapping vmas in the
> tree, which the "goto again" goes back to clean up.  I don't know if
> the interval tree gives any guarantee of which of those overlapping
> vmas would be found first in lookup, nor whether it could actually
> trigger a problem with page migration.  But I suspect that for safety
> we shall need to enforce the same protections on the next->next which
> will be removed a moment later.  Ah, no, it must *already* have the
> same protections, if it's about to be merged out of existence: so I
> think I was making up a problem where there is none, but please check.

I think you're right the temporary release of the locks would be a
problem, it's a corollary of case 5 issue above.

I attempted a quick update for further review (beware,
untested). Tomorrow I'll test it and let you know how it goes after
thinking more about it...

>From 4460351fffa25f0b939ef4cf529de5f2d8b5ca78 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@xxxxxxxxxx>
Date: Thu, 15 Sep 2016 17:20:10 +0200
Subject: [PATCH 1/1] mm: vma_merge: fix race vm_page_prot race condition
 against rmap_walk

The rmap_walk can access vm_page_prot (and potentially vm_flags in the
pte/pmd manipulations). So it's not safe to wait the caller to update
the vm_page_prot/vm_flags after vma_merge returned potentially
removing the "next" vma and extending the "current" vma over the
next->vm_start,vm_end range, but still with the "current" vma
vm_page_prot, after releasing the rmap locks.

The vm_page_prot/vm_flags must be transferred from the "next" vma to
the current vma while vma_merge still holds the rmap locks.

The side effect of this race condition is pte corruption during
migrate as remove_migration_ptes when run on a address of the "next"
vma that got removed, used the vm_page_prot of the current vma.

migrate	     	      	        mprotect
------------			-------------
migrating in "next" vma
				vma_merge() # removes "next" vma and
			        	    # extends "current" vma
					    # current vma is not with
					    # vm_page_prot updated
remove_migration_ptes
read vm_page_prot of current "vma"
establish pte with wrong permissions
				vm_set_page_prot(vma) # too late!
				change_protection in the old vma range
				only, next range is not updated

This caused segmentation faults and potentially memory corruption in
heavy mprotect loads with some light page migration caused by
compaction in the background.

v2: limit the scope to case 8 after review from Hugh Dickins.

Reported-by: Aditya Mandaleeka <adityam@xxxxxxxxxxxxx>
Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx>
---
 mm/mmap.c | 40 +++++++++++++++++++++++++++++++++++++---
 1 file changed, 37 insertions(+), 3 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index f86fd39..fb4dec8 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -727,6 +727,25 @@ again:
 			vma_interval_tree_remove(next, root);
 	}
 
+	if (remove_next == 1) {
+		/*
+		 * vm_page_prot and vm_flags can be read by the
+		 * rmap_walk, for example in remove_migration_ptes(),
+		 * so before releasing the rmap locks the permissions
+		 * of the expanded vmas must be already the correct
+		 * one for the whole merged range.
+		 *
+		 * mprotect case 8 (which sets remove_next == 1) needs
+		 * special handling to provide the above guarantee, as
+		 * it is the only case where the "vma" that is being
+		 * expanded is the one with the wrong permissions for
+		 * the whole merged region. So copy the right
+		 * permissions from the next one that is getting
+		 * removed before releasing the rmap locks.
+		 */
+		vma->vm_page_prot = next->vm_page_prot;
+		vma->vm_flags = next->vm_flags;
+	}
 	if (start != vma->vm_start) {
 		vma->vm_start = start;
 		start_changed = true;
@@ -807,7 +826,16 @@ again:
 		 */
 		next = vma->vm_next;
 		if (remove_next == 2) {
-			remove_next = 1;
+			/*
+			 * No need to transfer vm_page_prot/vm_flags
+			 * in the remove_next == 2 case,
+			 * vma_page_prot/vm_flags of the "vma" was
+			 * already the correct one for the whole range
+			 * in mprotect case 6. So set remove_next to 3
+			 * to skip that. It wouldn't hurt to execute
+			 * it but it's superflous.
+			 */
+			remove_next = 3;
 			end = next->vm_end;
 			goto again;
 		}
@@ -939,8 +967,14 @@ can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags,
  *    PPPP    NNNN    PPPPPPPPPPPP    PPPPPPPPNNNN    PPPPNNNNNNNN
  *    might become    case 1 below    case 2 below    case 3 below
  *
- * Odd one out? Case 8, because it extends NNNN but needs flags of XXXX:
- * mprotect_fixup updates vm_flags & vm_page_prot on successful return.
+ * Odd one out? Case 8, because it extends NNNN but needs the
+ * properties of XXXX. In turn the vma_merge caller must update the
+ * properties of successful return of vma_merge. An update in the
+ * caller of those properties is only ok if those properties are never
+ * accessed through rmap_walks (i.e. without the mmap_sem). In turn
+ * the vm_page_prot/vm_flags (which may be accessed by rmap_walks)
+ * must be transferred from XXXX to NNNN in case 8 before releasing
+ * the rmap locks.
  */
 struct vm_area_struct *vma_merge(struct mm_struct *mm,
 			struct vm_area_struct *prev, unsigned long addr,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>