[merged] mm-munlock-fix-potential-race-with-thp-page-split.patch removed from -mm tree

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Subject: [merged] mm-munlock-fix-potential-race-with-thp-page-split.patch removed from -mm tree
To: vbabka@xxxxxxx,aarcange@xxxxxxxxxx,hughd@xxxxxxxxxx,mgorman@xxxxxxx,riel@xxxxxxxxxx,sasha.levin@xxxxxxxxxx,stable@xxxxxxxxxxxxxxx,walken@xxxxxxxxxx,mm-commits@xxxxxxxxxxxxxxx
From: akpm@xxxxxxxxxxxxxxxxxxxx
Date: Fri, 24 Jan 2014 10:58:27 -0800


The patch titled
     Subject: mm: munlock: fix potential race with THP page split
has been removed from the -mm tree.  Its filename was
     mm-munlock-fix-potential-race-with-thp-page-split.patch

This patch was dropped because it was merged into mainline or a subsystem tree

------------------------------------------------------
From: Vlastimil Babka <vbabka@xxxxxxx>
Subject: mm: munlock: fix potential race with THP page split

Since commit ff6a6da60 ("mm: accelerate munlock() treatment of THP pages")
munlock skips tail pages of a munlocked THP page.  There is some attempt
to prevent bad consequences of racing with a THP page split, but code
inspection indicates that there are two problems that may lead to a
non-fatal, yet wrong outcome.

First, __split_huge_page_refcount() copies flags including PageMlocked
from the head page to the tail pages.  Clearing PageMlocked by
munlock_vma_page() in the middle of this operation might result in part of
tail pages left with PageMlocked flag.  As the head page still appears to
be a THP page until all tail pages are processed, munlock_vma_page() might
think it munlocked the whole THP page and skip all the former tail pages. 
Before ff6a6da60, those pages would be cleared in further iterations of
munlock_vma_pages_range(), but NR_MLOCK would still become undercounted
(related the next point).

Second, NR_MLOCK accounting is based on call to hpage_nr_pages() after the
PageMlocked is cleared.  The accounting might also become inconsistent due
to race with __split_huge_page_refcount()

- undercount when HUGE_PMD_NR is subtracted, but some tail pages are
  left with PageMlocked set and counted again (only possible before
  ff6a6da60)

- overcount when hpage_nr_pages() sees a normal page (split has already
  finished), but the parallel split has meanwhile cleared PageMlocked from
  additional tail pages

This patch prevents both problems via extending the scope of lru_lock in
munlock_vma_page().  This is convenient because:

- __split_huge_page_refcount() takes lru_lock for its whole operation

- munlock_vma_page() typically takes lru_lock anyway for page isolation

As this becomes a second function where page isolation is done with
lru_lock already held, factor this out to a new
__munlock_isolate_lru_page() function and clean up the code around.

[akpm@xxxxxxxxxxxxxxxxxxxx: avoid a coding-style ugly]
Signed-off-by: Vlastimil Babka <vbabka@xxxxxxx>
Cc: Sasha Levin <sasha.levin@xxxxxxxxxx>
Cc: Michel Lespinasse <walken@xxxxxxxxxx>
Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx>
Cc: Rik van Riel <riel@xxxxxxxxxx>
Cc: Mel Gorman <mgorman@xxxxxxx>
Cc: Hugh Dickins <hughd@xxxxxxxxxx>
Cc: <stable@xxxxxxxxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

 mm/mlock.c |  104 +++++++++++++++++++++++++++++----------------------
 1 file changed, 60 insertions(+), 44 deletions(-)

diff -puN mm/mlock.c~mm-munlock-fix-potential-race-with-thp-page-split mm/mlock.c
--- a/mm/mlock.c~mm-munlock-fix-potential-race-with-thp-page-split
+++ a/mm/mlock.c
@@ -91,6 +91,26 @@ void mlock_vma_page(struct page *page)
 }
 
 /*
+ * Isolate a page from LRU with optional get_page() pin.
+ * Assumes lru_lock already held and page already pinned.
+ */
+static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
+{
+	if (PageLRU(page)) {
+		struct lruvec *lruvec;
+
+		lruvec = mem_cgroup_page_lruvec(page, page_zone(page));
+		if (getpage)
+			get_page(page);
+		ClearPageLRU(page);
+		del_page_from_lru_list(page, lruvec, page_lru(page));
+		return true;
+	}
+
+	return false;
+}
+
+/*
  * Finish munlock after successful page isolation
  *
  * Page must be locked. This is a wrapper for try_to_munlock()
@@ -126,9 +146,9 @@ static void __munlock_isolated_page(stru
 static void __munlock_isolation_failed(struct page *page)
 {
 	if (PageUnevictable(page))
-		count_vm_event(UNEVICTABLE_PGSTRANDED);
+		__count_vm_event(UNEVICTABLE_PGSTRANDED);
 	else
-		count_vm_event(UNEVICTABLE_PGMUNLOCKED);
+		__count_vm_event(UNEVICTABLE_PGMUNLOCKED);
 }
 
 /**
@@ -152,28 +172,34 @@ static void __munlock_isolation_failed(s
 unsigned int munlock_vma_page(struct page *page)
 {
 	unsigned int nr_pages;
+	struct zone *zone = page_zone(page);
 
 	BUG_ON(!PageLocked(page));
 
-	if (TestClearPageMlocked(page)) {
-		nr_pages = hpage_nr_pages(page);
-		mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
-		if (!isolate_lru_page(page))
-			__munlock_isolated_page(page);
-		else
-			__munlock_isolation_failed(page);
-	} else {
-		nr_pages = hpage_nr_pages(page);
-	}
-
 	/*
-	 * Regardless of the original PageMlocked flag, we determine nr_pages
-	 * after touching the flag. This leaves a possible race with a THP page
-	 * split, such that a whole THP page was munlocked, but nr_pages == 1.
-	 * Returning a smaller mask due to that is OK, the worst that can
-	 * happen is subsequent useless scanning of the former tail pages.
-	 * The NR_MLOCK accounting can however become broken.
+	 * Serialize with any parallel __split_huge_page_refcount() which
+	 * might otherwise copy PageMlocked to part of the tail pages before
+	 * we clear it in the head page. It also stabilizes hpage_nr_pages().
 	 */
+	spin_lock_irq(&zone->lru_lock);
+
+	nr_pages = hpage_nr_pages(page);
+	if (!TestClearPageMlocked(page))
+		goto unlock_out;
+
+	__mod_zone_page_state(zone, NR_MLOCK, -nr_pages);
+
+	if (__munlock_isolate_lru_page(page, true)) {
+		spin_unlock_irq(&zone->lru_lock);
+		__munlock_isolated_page(page);
+		goto out;
+	}
+	__munlock_isolation_failed(page);
+
+unlock_out:
+	spin_unlock_irq(&zone->lru_lock);
+
+out:
 	return nr_pages - 1;
 }
 
@@ -310,34 +336,24 @@ static void __munlock_pagevec(struct pag
 		struct page *page = pvec->pages[i];
 
 		if (TestClearPageMlocked(page)) {
-			struct lruvec *lruvec;
-			int lru;
-
-			if (PageLRU(page)) {
-				lruvec = mem_cgroup_page_lruvec(page, zone);
-				lru = page_lru(page);
-				/*
-				 * We already have pin from follow_page_mask()
-				 * so we can spare the get_page() here.
-				 */
-				ClearPageLRU(page);
-				del_page_from_lru_list(page, lruvec, lru);
-			} else {
-				__munlock_isolation_failed(page);
-				goto skip_munlock;
-			}
-
-		} else {
-skip_munlock:
 			/*
-			 * We won't be munlocking this page in the next phase
-			 * but we still need to release the follow_page_mask()
-			 * pin. We cannot do it under lru_lock however. If it's
-			 * the last pin, __page_cache_release would deadlock.
+			 * We already have pin from follow_page_mask()
+			 * so we can spare the get_page() here.
 			 */
-			pagevec_add(&pvec_putback, pvec->pages[i]);
-			pvec->pages[i] = NULL;
+			if (__munlock_isolate_lru_page(page, false))
+				continue;
+			else
+				__munlock_isolation_failed(page);
 		}
+
+		/*
+		 * We won't be munlocking this page in the next phase
+		 * but we still need to release the follow_page_mask()
+		 * pin. We cannot do it under lru_lock however. If it's
+		 * the last pin, __page_cache_release() would deadlock.
+		 */
+		pagevec_add(&pvec_putback, pvec->pages[i]);
+		pvec->pages[i] = NULL;
 	}
 	delta_munlocked = -nr + pagevec_count(&pvec_putback);
 	__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
_

Patches currently in -mm which might be from vbabka@xxxxxxx are

origin.patch
mm-remove-bug_on-from-mlock_vma_page.patch

--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]