+ mm-swap-make-swap-cluster-size-same-of-thp-size-on-x86_64.patch added to -mm tree

akpm@xxxxxxxxxxxxxxxxxxxx · Tue, 28 Mar 2017 15:14:03 -0700

The patch titled
     Subject: mm, swap: make swap cluster size same of THP size on x86_64
has been added to the -mm tree.  Its filename is
     mm-swap-make-swap-cluster-size-same-of-thp-size-on-x86_64.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-swap-make-swap-cluster-size-same-of-thp-size-on-x86_64.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-swap-make-swap-cluster-size-same-of-thp-size-on-x86_64.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/SubmitChecklist when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Huang Ying <ying.huang@xxxxxxxxx>
Subject: mm, swap: make swap cluster size same of THP size on x86_64

Patch series "THP swap: Delay splitting THP during swapping out", v7.

Recently, the performance of the storage devices improved so fast that
we cannot saturate the disk bandwidth with single logical CPU when do
page swap out even on a high-end server machine.  Because the
performance of the storage device improved faster than that of single
logical CPU.  And it seems that the trend will not change in the near
future.  On the other hand, the THP becomes more and more popular
because of increased memory size.  So it becomes necessary to optimize
THP swap performance.

The advantages of the THP swap support include:

- Batch the swap operations for the THP to reduce lock
  acquiring/releasing, including allocating/freeing the swap space,
  adding/deleting to/from the swap cache, and writing/reading the swap
  space, etc.  This will help improve the performance of the THP swap.

- The THP swap space read/write will be 2M sequential IO.  It is
  particularly helpful for the swap read, which are usually 4k random
  IO.  This will improve the performance of the THP swap too.

- It will help the memory fragmentation, especially when the THP is
  heavily used by the applications.  The 2M continuous pages will be
  free up after THP swapping out.

- It will improve the THP utilization on the system with the swap
  turned on.  Because the speed for khugepaged to collapse the normal
  pages into the THP is quite slow.  After the THP is split during the
  swapping out, it will take quite long time for the normal pages to
  collapse back into the THP after being swapped in.  The high THP
  utilization helps the efficiency of the page based memory management
  too.

There are some concerns regarding THP swap in, mainly because possible
enlarged read/write IO size (for swap in/out) may put more overhead on
the storage device.  To deal with that, the THP swap in should be
turned on only when necessary.  For example, it can be selected via
"always/never/madvise" logic, to be turned on globally, turned off
globally, or turned on only for VMA with MADV_HUGEPAGE, etc.

This patchset is based on 03/17 head of mmotm/master.

This patchset is the first step for the THP swap support.  The plan is
to delay splitting THP step by step, finally avoid splitting THP
during the THP swapping out and swap out/in the THP as a whole.

As the first step, in this patchset, the splitting huge page is
delayed from almost the first step of swapping out to after allocating
the swap space for the THP and adding the THP into the swap cache.
This will reduce lock acquiring/releasing for the locks used for the
swap cache management.

With the patchset, the swap out throughput improves 14.9% (from about
3.77GB/s to about 4.34GB/s) in the vm-scalability swap-w-seq test case
with 8 processes.  The test is done on a Xeon E5 v3 system.  The swap
device used is a RAM simulated PMEM (persistent memory) device.  To
test the sequential swapping out, the test case creates 8 processes,
which sequentially allocate and write to the anonymous pages until the
RAM and part of the swap device is used up.

The detailed comparison result is as follow,

base             base+patchset
---------------- -------------------------- 
         %stddev     %change         %stddev
             \          |                \  
   7043990 Â±  0%     +21.2%    8536807 Â±  0%  vm-scalability.throughput
    109.94 Â±  1%     -16.2%      92.09 Â±  0%  vm-scalability.time.elapsed_time
   3957091 Â±  0%     +14.9%    4547173 Â±  0%  vmstat.swap.so
     31.46 Â±  1%     -38.3%      19.42 Â±  0%  perf-stat.cache-miss-rate%
      1.04 Â±  1%     +22.2%       1.27 Â±  0%  perf-stat.ipc
      9.33 Â±  2%     -60.7%       3.67 Â±  1%  perf-profile.calltrace.cycles-pp.add_to_swap.shrink_page_list.shrink_inactive_list.shrink_node_memcg.shrink_node



This patch (of 9):

In this patch, the size of the swap cluster is changed to that of the THP
(Transparent Huge Page) on x86_64 architecture (512).  This is for the THP
swap support on x86_64.  Where one swap cluster will be used to hold the
contents of each THP swapped out.  And some information of the swapped out
THP (such as compound map count) will be recorded in the swap_cluster_info
data structure.

For other architectures which want THP swap support,
ARCH_USES_THP_SWAP_CLUSTER need to be selected in the Kconfig file for the
architecture.

In effect, this will enlarge swap cluster size by 2 times on x86_64. 
Which may make it harder to find a free cluster when the swap space
becomes fragmented.  So that, this may reduce the continuous swap space
allocation and sequential write in theory.  The performance test in 0day
shows no regressions caused by this.

Link: http://lkml.kernel.org/r/20170328053209.25876-2-ying.huang@xxxxxxxxx
Suggested-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Signed-off-by: "Huang, Ying" <ying.huang@xxxxxxxxx>
Cc: Hugh Dickins <hughd@xxxxxxxxxx>
Cc: Shaohua Li <shli@xxxxxxxxxx>
Cc: Minchan Kim <minchan@xxxxxxxxxx>
Cc: Rik van Riel <riel@xxxxxxxxxx>
Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx>
Cc: Ebru Akagunduz <ebru.akagunduz@xxxxxxxxx>
Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
Cc: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx>
Cc: Michal Hocko <mhocko@xxxxxxxxxx>
Cc: Tejun Heo <tj@xxxxxxxxxx>
Cc: Vladimir Davydov <vdavydov@xxxxxxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

 arch/x86/Kconfig |    1 +
 mm/Kconfig       |   13 +++++++++++++
 mm/swapfile.c    |    4 ++++
 3 files changed, 18 insertions(+)

diff -puN arch/x86/Kconfig~mm-swap-make-swap-cluster-size-same-of-thp-size-on-x86_64 arch/x86/Kconfig

--- a/arch/x86/Kconfig~mm-swap-make-swap-cluster-size-same-of-thp-size-on-x86_64
+++ a/arch/x86/Kconfig
@@ -175,6 +175,7 @@ config X86
 	select USER_STACKTRACE_SUPPORT
 	select VIRT_TO_BUS
 	select X86_FEATURE_NAMES		if PROC_FS
+	select ARCH_USES_THP_SWAP_CLUSTER	if X86_64
 
 config INSTRUCTION_DECODER
 	def_bool y
diff -puN mm/Kconfig~mm-swap-make-swap-cluster-size-same-of-thp-size-on-x86_64 mm/Kconfig
--- a/mm/Kconfig~mm-swap-make-swap-cluster-size-same-of-thp-size-on-x86_64
+++ a/mm/Kconfig
@@ -499,6 +499,19 @@ config FRONTSWAP
 
 	  If unsure, say Y to enable frontswap.
 
+config ARCH_USES_THP_SWAP_CLUSTER
+	bool
+	default n
+
+config THP_SWAP_CLUSTER
+	bool
+	depends on SWAP && TRANSPARENT_HUGEPAGE && ARCH_USES_THP_SWAP_CLUSTER
+	default y
+	help
+	  Use one swap cluster to hold the contents of the THP
+	  (Transparent Huge Page) swapped out.  The size of the swap
+	  cluster will be same as that of THP.
+
 config CMA
 	bool "Contiguous Memory Allocator"
 	depends on HAVE_MEMBLOCK && MMU
diff -puN mm/swapfile.c~mm-swap-make-swap-cluster-size-same-of-thp-size-on-x86_64 mm/swapfile.c
--- a/mm/swapfile.c~mm-swap-make-swap-cluster-size-same-of-thp-size-on-x86_64
+++ a/mm/swapfile.c
@@ -199,7 +199,11 @@ static void discard_swap_cluster(struct
 	}
 }
 
+#ifdef CONFIG_THP_SWAP_CLUSTER
+#define SWAPFILE_CLUSTER	HPAGE_PMD_NR
+#else
 #define SWAPFILE_CLUSTER	256
+#endif
 #define LATENCY_LIMIT		256
 
 static inline void cluster_set_flag(struct swap_cluster_info *info,
_

Patches currently in -mm which might be from ying.huang@xxxxxxxxx are

mm-swap-fix-a-race-in-free_swap_and_cache.patch
mm-swap-fix-comment-in-__read_swap_cache_async.patch
mm-swap-improve-readability-via-make-spin_lock-unlock-balanced.patch
mm-swap-avoid-lock-swap_avail_lock-when-held-cluster-lock.patch
mm-swap-make-swap-cluster-size-same-of-thp-size-on-x86_64.patch
mm-memcg-support-to-charge-uncharge-multiple-swap-entries.patch
mm-thp-swap-add-swap-cluster-allocate-free-functions.patch
mm-thp-swap-add-get_huge_swap_page.patch
mm-thp-swap-support-to-clear-swap_has_cache-for-huge-page.patch
mm-thp-swap-support-to-add-delete-thp-to-from-swap-cache.patch
mm-thp-add-can_split_huge_page.patch
mm-thp-swap-support-to-split-thp-in-swap-cache.patch
mm-thp-swap-delay-splitting-thp-during-swap-out.patch

--
To unsubscribe from this list: send the line "unsubscribe mm-commits" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html