Re: [PATCH 1/1] mm: thp: Set THP defrag by default to madvise and add a stall-free defrag option

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 02/26/2016 05:15 PM, Mel Gorman wrote:
Changelog since v1
o Default defrag to madvise instead of never
o Introduce "defer" defrag option to wake kswapd/kcompact if THP is unavailable
o Restore "always" to historical behaviour
o Update documentation

THP defrag is enabled by default to direct reclaim/compact but not wake
kswapd in the event of a THP allocation failure. The problem is that THP
allocation requests potentially enter reclaim/compaction. This potentially
incurs a severe stall that is not guaranteed to be offset by reduced TLB
misses. While there has been considerable effort to reduce the impact
of reclaim/compaction, it is still a high cost and workloads that should
fit in memory fail to do so. Specifically, a simple anon/file streaming
workload will enter direct reclaim on NUMA at least even though the working
set size is 80% of RAM. It's been years and it's time to throw in the towel.

First, this patch defines THP defrag as follows;

madvise: A failed allocation will direct reclaim/compact if the application requests it
never:   Neither reclaim/compact nor wake kswapd
defer:   A failed allocation will wake kswapd/kcompactd
always:  A failed allocation will direct reclaim/compact (historical behaviour)
khugepaged defrag will enter direct/reclaim but not wake kswapd.

Next it sets the default defrag option to be "madvise" to only enter direct
reclaim/compaction for applications that specifically requested it.

Lastly, it removes a check from the page allocator slowpath that is related
to __GFP_THISNODE to allow "defer" to work. The callers that really cares are
slub/slab and they are updated accordingly. The slab one may be surprising
because it also corrects a comment as kswapd was never woken up by that path.

It would be also nice if we could remove the is_thp_gfp_mask() checks one day. They try to make direct reclaim/compaction for THP less intrusive, but maybe could be removed now that the stalls are limited otherwise. But that's out of scope here.

[...]

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 8a282687ee06..a19b173cbc57 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -113,9 +113,26 @@ guaranteed, but it may be more likely in case the allocation is for a
  MADV_HUGEPAGE region.

  echo always >/sys/kernel/mm/transparent_hugepage/defrag
+echo defer >/sys/kernel/mm/transparent_hugepage/defrag
  echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
  echo never >/sys/kernel/mm/transparent_hugepage/defrag

+"always" means that an application requesting THP will stall on allocation
+failure and directly reclaim pages and compact memory in an effort to
+allocate a THP immediately. This may be desirable for virtual machines
+that benefit heavily from THP use and are willing to delay the VM start
+to utilise them.
+
+"defer" means that an application will wake kswapd in the background
+to reclaim pages and wake kcompact to compact memory so that THP is
+available in the near future. It's the responsibility of khugepaged
+to then install the THP pages later.
+
+"madvise" will enter direct reclaim like "always" but only for regions
+that are have used madvise(). This is the default behaviour.

"madvise(MADV_HUGEPAGE)" perhaps?

[...]

@@ -277,17 +273,23 @@ static ssize_t double_flag_store(struct kobject *kobj,
  static ssize_t enabled_show(struct kobject *kobj,
  			    struct kobj_attribute *attr, char *buf)
  {
-	return double_flag_show(kobj, attr, buf,
-				TRANSPARENT_HUGEPAGE_FLAG,
-				TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
+	if (test_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags)) {
+		VM_BUG_ON(test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags));
+		return sprintf(buf, "[always] madvise never\n");
+	} else if (test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags))
+		return sprintf(buf, "always [madvise] never\n");
+	else
+		return sprintf(buf, "always madvise [never]\n");

Somewhat ugly wrt consistent usage of { }, due to the VM_BUG_ON(), which I would just drop. Also I wonder if some racy read vs write of the file can trigger the BUG_ON? Or are the kobject accesses synchronized at a higher level?

  }
+
  static ssize_t enabled_store(struct kobject *kobj,
  			     struct kobj_attribute *attr,
  			     const char *buf, size_t count)
  {
  	ssize_t ret;

-	ret = double_flag_store(kobj, attr, buf, count,
+	ret = triple_flag_store(kobj, attr, buf, count,
+				TRANSPARENT_HUGEPAGE_FLAG,
  				TRANSPARENT_HUGEPAGE_FLAG,
  				TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);

So this makes "echo defer > enabled" behave just like "echo always"? For userspace interface that becomes a fixed ABI, I would prefer to be more careful with unintended aliases like this. Maybe pass something like "-1" that triple_flag_store() would check in the "defer" case and return -EINVAL?

@@ -345,16 +347,23 @@ static ssize_t single_flag_store(struct kobject *kobj,
  static ssize_t defrag_show(struct kobject *kobj,
  			   struct kobj_attribute *attr, char *buf)
  {
-	return double_flag_show(kobj, attr, buf,
-				TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
-				TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
+	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
+		return sprintf(buf, "[always] defer madvise never\n");
+	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
+		return sprintf(buf, "always [defer] madvise never\n");
+	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
+		return sprintf(buf, "always defer [madvise] never\n");
+	else
+		return sprintf(buf, "always defer madvise [never]\n");
+
  }
  static ssize_t defrag_store(struct kobject *kobj,
  			    struct kobj_attribute *attr,
  			    const char *buf, size_t count)
  {
-	return double_flag_store(kobj, attr, buf, count,
-				 TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
+	return triple_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
+				 TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
  				 TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
  }
  static struct kobj_attribute defrag_attr =
@@ -784,9 +793,30 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
  	return 0;
  }

-static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
+/*
+ * If THP is set to always then directly reclaim/compact as necessary
+ * If set to defer then do no reclaim and defer to khugepaged
+ * If set to madvise and the VMA is flagged then directly reclaim/compact
+ */
+static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
+{
+	gfp_t reclaim_flags = 0;
+
+	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags) &&
+	    (vma->vm_flags & VM_HUGEPAGE))
+		reclaim_flags = __GFP_DIRECT_RECLAIM;
+	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
+		reclaim_flags = __GFP_KSWAPD_RECLAIM;
+	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
+		reclaim_flags = __GFP_DIRECT_RECLAIM;

Hmm, here's a trick question. What if I wanted direct reclaim for madvise() vma's and kswapd/kcompactd for others? Right now there's no such option, right? And expressing that with different values for a single tunable becomes ugly...

(For completeness, somebody could want kswapd/kcompactd defrag only for madvise() but that's arguably not much useful)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>



[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]