+ mm-use-mm_zero_struct_page-from-sparc-on-all-64b-architectures.patch added to -mm tree

akpm@xxxxxxxxxxxxxxxxxxxx · Tue, 06 Nov 2018 14:33:53 -0800

The patch titled
     Subject: mm: use mm_zero_struct_page from SPARC on all 64b architectures
has been added to the -mm tree.  Its filename is
     mm-use-mm_zero_struct_page-from-sparc-on-all-64b-architectures.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-use-mm_zero_struct_page-from-sparc-on-all-64b-architectures.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-use-mm_zero_struct_page-from-sparc-on-all-64b-architectures.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Alexander Duyck <alexander.h.duyck@xxxxxxxxxxxxxxx>
Subject: mm: use mm_zero_struct_page from SPARC on all 64b architectures

Patch series "Deferred page init improvements", v5.

This patchset is essentially a refactor of the page initialization logic
that is meant to provide for better code reuse while providing a
significant improvement in deferred page initialization performance.

In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
memory per node I have seen the following.  In the case of regular memory
initialization the deferred init time was decreased from 3.75s to 1.06s on
average.  For the persistent memory the initialization time dropped from
24.17s to 19.12s on average.  This amounts to a 253% improvement for the
deferred memory initialization performance, and a 26% improvement in the
persistent memory initialization performance.

I have called out the improvement observed with each patch.


This patch (of 7):

This change makes it so that we use the same approach that was already in
use on Sparc on all the archtectures that support a 64b long.

This is mostly motivated by the fact that 7 to 10 store/move instructions
are likely always going to be faster than having to call into a function
that is not specialized for handling page init.

An added advantage to doing it this way is that the compiler can get away
with combining writes in the __init_single_page call.  As a result the
memset call will be reduced to only about 4 write operations, or at least
that is what I am seeing with GCC 6.2 as the flags, LRU poitners, and
count/mapcount seem to be cancelling out at least 4 of the 8 assignments
on my system.

One change I had to make to the function was to reduce the minimum page
size to 56 to support some powerpc64 configurations.

This change should introduce no change on SPARC since it already had this
code.  In the case of x86_64 I saw a reduction from 3.75s to 2.80s when
initializing 384GB of RAM per node.  Pavel Tatashin tested on a system
with Broadcom's Stingray CPU and 48GB of RAM and found that
__init_single_page() takes 19.30ns / 64-byte struct page before this patch
and with this patch it takes 17.33ns / 64-byte struct page.  Mike Rapoport
ran a similar test on a OpenPower (S812LC 8348-21C) with Power8 processor
and 128GB or RAM.  His results per 64-byte struct page were 4.68ns before,
and 4.59ns after this patch.

Link: http://lkml.kernel.org/r/154145277052.30046.9861512989491054994.stgit@xxxxxxxxxxxxxxxxxxxxxxxxxx
Signed-off-by: Alexander Duyck <alexander.h.duyck@xxxxxxxxxxxxxxx>
Reviewed-by: Pavel Tatashin <pavel.tatashin@xxxxxxxxxxxxx>
Acked-by: Michal Hocko <mhocko@xxxxxxxx>
Cc: David S. Miller <davem@xxxxxxxxxxxxx>
Cc: Pavel Tatashin <pavel.tatashin@xxxxxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx>
Cc: Dan Williams <dan.j.williams@xxxxxxxxx>
Cc: Dave Jiang <dave.jiang@xxxxxxxxx>
Cc: Mike Rapoport <rppt@xxxxxxxxxxxxxxxxxx>
Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx>
Cc: Vlastimil Babka <vbabka@xxxxxxx>
Cc: Khalid Aziz <khalid.aziz@xxxxxxxxxx>
Cc: Laurent Dufour <ldufour@xxxxxxxxxxxxxxxxxx>
Cc: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
Cc: Oscar Salvador <osalvador@xxxxxxxxxxxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

 arch/sparc/include/asm/pgtable_64.h |   30 -------------------
 include/linux/mm.h                  |   41 ++++++++++++++++++++++++--
 2 files changed, 38 insertions(+), 33 deletions(-)

--- a/arch/sparc/include/asm/pgtable_64.h~mm-use-mm_zero_struct_page-from-sparc-on-all-64b-architectures
+++ a/arch/sparc/include/asm/pgtable_64.h
@@ -231,36 +231,6 @@ extern unsigned long _PAGE_ALL_SZ_BITS;
 extern struct page *mem_map_zero;
 #define ZERO_PAGE(vaddr)	(mem_map_zero)
 
-/* This macro must be updated when the size of struct page grows above 80
- * or reduces below 64.
- * The idea that compiler optimizes out switch() statement, and only
- * leaves clrx instructions
- */
-#define	mm_zero_struct_page(pp) do {					\
-	unsigned long *_pp = (void *)(pp);				\
-									\
-	 /* Check that struct page is either 64, 72, or 80 bytes */	\
-	BUILD_BUG_ON(sizeof(struct page) & 7);				\
-	BUILD_BUG_ON(sizeof(struct page) < 64);				\
-	BUILD_BUG_ON(sizeof(struct page) > 80);				\
-									\
-	switch (sizeof(struct page)) {					\
-	case 80:							\
-		_pp[9] = 0;	/* fallthrough */			\
-	case 72:							\
-		_pp[8] = 0;	/* fallthrough */			\
-	default:							\
-		_pp[7] = 0;						\
-		_pp[6] = 0;						\
-		_pp[5] = 0;						\
-		_pp[4] = 0;						\
-		_pp[3] = 0;						\
-		_pp[2] = 0;						\
-		_pp[1] = 0;						\
-		_pp[0] = 0;						\
-	}								\
-} while (0)
-
 /* PFNs are real physical page numbers.  However, mem_map only begins to record
  * per-page information starting at pfn_base.  This is to handle systems where
  * the first physical page in the machine is at some huge physical address,
--- a/include/linux/mm.h~mm-use-mm_zero_struct_page-from-sparc-on-all-64b-architectures
+++ a/include/linux/mm.h
@@ -98,10 +98,45 @@ extern int mmap_rnd_compat_bits __read_m
 
 /*
  * On some architectures it is expensive to call memset() for small sizes.
- * Those architectures should provide their own implementation of "struct page"
- * zeroing by defining this macro in <asm/pgtable.h>.
+ * If an architecture decides to implement their own version of
+ * mm_zero_struct_page they should wrap the defines below in a #ifndef and
+ * define their own version of this macro in <asm/pgtable.h>
  */
-#ifndef mm_zero_struct_page
+#if BITS_PER_LONG == 64
+/* This function must be updated when the size of struct page grows above 80
+ * or reduces below 56. The idea that compiler optimizes out switch()
+ * statement, and only leaves move/store instructions. Also the compiler can
+ * combine write statments if they are both assignments and can be reordered,
+ * this can result in several of the writes here being dropped.
+ */
+#define	mm_zero_struct_page(pp) __mm_zero_struct_page(pp)
+static inline void __mm_zero_struct_page(struct page *page)
+{
+	unsigned long *_pp = (void *)page;
+
+	 /* Check that struct page is either 56, 64, 72, or 80 bytes */
+	BUILD_BUG_ON(sizeof(struct page) & 7);
+	BUILD_BUG_ON(sizeof(struct page) < 56);
+	BUILD_BUG_ON(sizeof(struct page) > 80);
+
+	switch (sizeof(struct page)) {
+	case 80:
+		_pp[9] = 0;	/* fallthrough */
+	case 72:
+		_pp[8] = 0;	/* fallthrough */
+	case 64:
+		_pp[7] = 0;	/* fallthrough */
+	case 56:
+		_pp[6] = 0;
+		_pp[5] = 0;
+		_pp[4] = 0;
+		_pp[3] = 0;
+		_pp[2] = 0;
+		_pp[1] = 0;
+		_pp[0] = 0;
+	}
+}
+#else
 #define mm_zero_struct_page(pp)  ((void)memset((pp), 0, sizeof(struct page)))
 #endif
 
_

Patches currently in -mm which might be from alexander.h.duyck@xxxxxxxxxxxxxxx are

mm-use-mm_zero_struct_page-from-sparc-on-all-64b-architectures.patch
mm-drop-meminit_pfn_in_nid-as-it-is-redundant.patch
mm-implement-new-zone-specific-memblock-iterator.patch
mm-initialize-max_order_nr_pages-at-a-time-instead-of-doing-larger-sections.patch
mm-move-hot-plug-specific-memory-init-into-separate-functions-and-optimize.patch
mm-add-reserved-flag-setting-to-set_page_links.patch
mm-use-common-iterator-for-deferred_init_pages-and-deferred_free_pages.patch