Re: [PATCH v2 7/7] mm: munlock: manual pte walk in fast path instead of follow_page_mask()

Vlastimil Babka <vbabka@xxxxxxx> · Thu, 29 Aug 2013 15:02:30 +0200

On 08/28/2013 12:24 AM, Andrew Morton wrote:
> On Mon, 19 Aug 2013 14:23:42 +0200 Vlastimil Babka <vbabka@xxxxxxx> wrote:
> 
>> Currently munlock_vma_pages_range() calls follow_page_mask() to obtain each
>> struct page. This entails repeated full page table translations and page table
>> lock taken for each page separately.
>>
>> This patch attempts to avoid the costly follow_page_mask() where possible, by
>> iterating over ptes within single pmd under single page table lock. The first
>> pte is obtained by get_locked_pte() for non-THP page acquired by the initial
>> follow_page_mask(). The latter function is also used as a fallback in case
>> simple pte_present() and vm_normal_page() are not sufficient to obtain the
>> struct page.
> 
> mm/mlock.c: In function 'munlock_vma_pages_range':
> mm/mlock.c:388: warning: 'pmd_end' may be used uninitialized in this function
> 
> As far as I can tell, this is notabug, but I'm not at all confident in
> that - the protocol for locals `pte' and `pmd_end' is bizarre.

I agree with both points.
 
> The function is fantastically hard to follow and deserves to be dragged
> outside, shot repeatedly then burned.

Aww, poor function, and it's all my fault. Let's put it on a diet instead...

> Could you please, as a matter of
> some urgency, take a look at rewriting the entire thing so that it is
> less than completely insane?

This patch replaces the following patch in the mm tree:
mm-munlock-manual-pte-walk-in-fast-path-instead-of-follow_page_mask.patch

Changelog since V2:
  o Split PTE walk to __munlock_pagevec_fill()
  o __munlock_pagevec() does not reinitialize the pagevec anymore
  o Use page_zone_id() for checking if pages are in the same zone (smaller
    overhead than page_zone())

The only small functional change is that previously failing the pte walk would
fall back to follow_page_mask() while continuing with the same partially filled
pagevec. Now, pagevec is munlocked immediately after pte walk fails. This means
that batching might be sometimes less effective, but it the gained simplicity
should be worth it.

--->8---

>From d73d5fd42d55ccdf8833ce8d244dba4e816b1ee0 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@xxxxxxx>
Date: Wed, 7 Aug 2013 10:43:31 +0200
Subject: [PATCH v3 7/7] mm: munlock: manual pte walk in fast path instead of
 follow_page_mask()

Currently munlock_vma_pages_range() calls follow_page_mask() to obtain each
individual struct page. This entails repeated full page table translations and
page table lock taken for each page separately.

This patch avoids the costly follow_page_mask() where possible, by iterating
over ptes within single pmd under single page table lock. The first pte is
obtained by get_locked_pte() for non-THP page acquired by the initial
follow_page_mask(). The rest of the on-stack pagevec for munlock is filled up
using pte_walk as long as pte_present() and vm_normal_page() are sufficient to
obtain the struct page.

After this patch, a 14% speedup was measured for munlocking a 56GB large
memory area with THP disabled.

Signed-off-by: Vlastimil Babka <vbabka@xxxxxxx>
---
 include/linux/mm.h |  12 +++---
 mm/mlock.c         | 110 ++++++++++++++++++++++++++++++++++++++---------------
 2 files changed, 85 insertions(+), 37 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f022460..942333e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -637,12 +637,12 @@ static inline enum zone_type page_zonenum(const struct page *page)
 #endif
 
 /*
- * The identification function is only used by the buddy allocator for
- * determining if two pages could be buddies. We are not really
- * identifying a zone since we could be using a the section number
- * id if we have not node id available in page flags.
- * We guarantee only that it will return the same value for two
- * combinable pages in a zone.
+ * The identification function is mainly used by the buddy allocator for
+ * determining if two pages could be buddies. We are not really identifying
+ * the zone since we could be using the section number id if we do not have
+ * node id available in page flags.
+ * We only guarantee that it will return the same value for two combinable
+ * pages in a zone.
  */
 static inline int page_zone_id(struct page *page)
 {
diff --git a/mm/mlock.c b/mm/mlock.c
index c7b03c8..c944457 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -280,8 +280,7 @@ static void __putback_lru_fast(struct pagevec *pvec, int pgrescued)
  * The second phase finishes the munlock only for pages where isolation
  * succeeded.
  *
- * Note that the pagevec may be modified during the process and before
- * returning, pagevec_reinit() is called.
+ * Note that the pagevec may be modified during the process.
  */
 static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 {
@@ -356,8 +355,60 @@ skip_munlock:
 	 */
 	if (pagevec_count(&pvec_putback))
 		__putback_lru_fast(&pvec_putback, pgrescued);
+}
 
-	pagevec_reinit(pvec);
+/*
+ * Fill up pagevec for __munlock_pagevec using pte walk
+ *
+ * The function expects that the struct page corresponding to @start address is
+ * a non-TPH page already pinned and in the @pvec, and that it belongs to @zone.
+ *
+ * The rest of @pvec is filled by subsequent pages within the same pmd and same
+ * zone, as long as the pte's are present and vm_normal_page() succeeds. These
+ * pages also get pinned.
+ *
+ * Returns the address of the next page that should be scanned. This equals
+ * @start + PAGE_SIZE when no page could be added by the pte walk.
+ */
+static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
+		struct vm_area_struct *vma, int zoneid,	unsigned long start,
+		unsigned long end)
+{
+	pte_t *pte;
+	spinlock_t *ptl;
+
+	/*
+	 * Initialize pte walk starting at the already pinned page where we
+	 * are sure that there is a pte.
+	 */
+	pte = get_locked_pte(vma->vm_mm, start,	&ptl);
+	end = min(end, pmd_addr_end(start, end));
+
+	/* The page next to the pinned page is the first we will try to get */
+	start += PAGE_SIZE;
+	while (start < end) {
+		struct page *page = NULL;
+		pte++;
+		if (pte_present(*pte))
+			page = vm_normal_page(vma, start, *pte);
+		/* 
+		 * Break if page could not be obtained or the page's node+zone does not
+		 * match
+		 */
+		if (!page || page_zone_id(page) != zoneid)
+			break;
+
+		get_page(page);
+		/*
+		 * Increase the address that will be returned *before* the
+		 * eventual break due to pvec becoming full by adding the page
+		 */
+		start += PAGE_SIZE;
+		if (pagevec_add(pvec, page) == 0)
+			break;
+	}
+	pte_unmap_unlock(pte, ptl);
+	return start;
 }
 
 /*
@@ -381,17 +432,16 @@ skip_munlock:
 void munlock_vma_pages_range(struct vm_area_struct *vma,
 			     unsigned long start, unsigned long end)
 {
-	struct pagevec pvec;
-	struct zone *zone = NULL;
-
-	pagevec_init(&pvec, 0);
 	vma->vm_flags &= ~VM_LOCKED;
 
 	while (start < end) {
-		struct page *page;
+		struct page *page = NULL;
 		unsigned int page_mask, page_increm;
-		struct zone *pagezone;
+		struct pagevec pvec;
+		struct zone *zone;
+		int zoneid;
 
+		pagevec_init(&pvec, 0);
 		/*
 		 * Although FOLL_DUMP is intended for get_dump_page(),
 		 * it just so happens that its special treatment of the
@@ -400,22 +450,10 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
 		 * has sneaked into the range, we won't oops here: great).
 		 */
 		page = follow_page_mask(vma, start, FOLL_GET | FOLL_DUMP,
-					&page_mask);
+				&page_mask);
+
 		if (page && !IS_ERR(page)) {
-			pagezone = page_zone(page);
-			/* The whole pagevec must be in the same zone */
-			if (pagezone != zone) {
-				if (pagevec_count(&pvec))
-					__munlock_pagevec(&pvec, zone);
-				zone = pagezone;
-			}
 			if (PageTransHuge(page)) {
-				/*
-				 * THP pages are not handled by pagevec due
-				 * to their possible split (see below).
-				 */
-				if (pagevec_count(&pvec))
-					__munlock_pagevec(&pvec, zone);
 				lock_page(page);
 				/*
 				 * Any THP page found by follow_page_mask() may
@@ -428,21 +466,31 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
 				put_page(page); /* follow_page_mask() */
 			} else {
 				/*
-				 * Non-huge pages are handled in batches
-				 * via pagevec. The pin from
-				 * follow_page_mask() prevents them from
-				 * collapsing by THP.
+				 * Non-huge pages are handled in batches via
+				 * pagevec. The pin from follow_page_mask()
+				 * prevents them from collapsing by THP.
 				 */
-				if (pagevec_add(&pvec, page) == 0)
-					__munlock_pagevec(&pvec, zone);
+				pagevec_add(&pvec, page);
+				zone = page_zone(page);
+				zoneid = page_zone_id(page);
+
+				/*
+				 * Try to fill the rest of pagevec using fast
+				 * pte walk. This will also update start to
+				 * the next page to process. Then munlock the
+				 * pagevec.
+				 */
+				start = __munlock_pagevec_fill(&pvec, vma,
+						zoneid, start, end);
+				__munlock_pagevec(&pvec, zone);
+				goto next;
 			}
 		}
 		page_increm = 1 + (~(start >> PAGE_SHIFT) & page_mask);
 		start += page_increm * PAGE_SIZE;
+next:
 		cond_resched();
 	}
-	if (pagevec_count(&pvec))
-		__munlock_pagevec(&pvec, zone);
 }
 
 /*
-- 
1.8.1.4


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>