[RFC 0/6] the big khugepaged redesign

Vlastimil Babka <vbabka@xxxxxxx> · Mon, 23 Feb 2015 13:58:36 +0100

Recently, there was concern expressed (e.g. [1]) whether the quite aggressive
THP allocation attempts on page faults are a good performance trade-off.

- THP allocations add to page fault latency, as high-order allocations are
  notoriously expensive. Page allocation slowpath now does extra checks for
  GFP_TRANSHUGE && !PF_KTHREAD to avoid the more expensive synchronous
  compaction for user page faults. But even async compaction can be expensive.
- During the first page fault in a 2MB range we cannot predict how much of the
  range will be actually accessed - we can theoretically waste as much as 511
  worth of pages [2]. Or, the pages in the range might be accessed from CPUs
  from different NUMA nodes and while base pages could be all local, THP could
  be remote to all but one CPU. The cost of remote accesses due to this false
  sharing would be higher than any savings on the TLB.
- The interaction with memcg are also problematic [1].

Now I don't have any hard data to show how big these problems are, and I
expect we will discuss this on LSF/MM (and hope somebody has such data [3]).
But it's certain that e.g. SAP recommends to disable THPs [4] for their apps
for performance reasons.

One might think that instead of fully disabling THP's it should be possible to
only disable (or make less aggressive, or limit to MADV_HUGEPAGE regions) THP's
for page faults and leave the collapsing up to khugepaged, which would hide the
latencies and allow better decisions based on how many base pages were faulted
in and from which nodes. However, looking more closely gives the impression
that khugepaged was meant rather as a rarely needed fallback for cases where
the THP page fault fails due to e.g. low memory. There are some tunables under
/sys/kernel/mm/transparent_hugepage/ but it doesn't seem sufficient for moving
the bulk of the THP work to khugepaged as it is.

- setting "defrag" to "madvise" or "never", while leaving khugepaged/defrag=1
  will result in very lightweight THP allocation attempts during page faults.
  This is nice and solves the latency problem, but not the other problems
  described above. It doesn't seem possible to disable page fault THP's
  completely without also disabling khugepaged.
- even if it was possible, the default settings for khugepaged are to scan up
  to 8 PMD's and collapse up to 1 THP per 10 seconds. That's probably too slow
  for some workloads and machines, but if one was to set this to be more
  aggressive, it would become quite inefficient. Khugepaged has a single global
  list of mm's to scan, which may include lot of tasks where scanning won't yield
  anything. It should rather focus on tasks that are actually running (and thus
  could benefit) and where collapses were recently successful. The activity
  should also be ideally accounted to the task that benefits from it.
- scanning on NUMA systems will proceed even when actual THP allocations fail,
  e.g. during memory pressure. In such case it should be better to save the
  scanning time until memory is available.
- there were some limitations on which PMD's khugepaged can collapse. Thanks to
  Ebru's recent patches, this should be fixed soon. With the
  khugepaged/max_ptes_none tunable, one can limit the potential memory wasted.
  But limiting NUMA false sharing is performed just in zone reclaim mode and
  could be made stricter.

This RFC patchset doesn't try to solve everything mentioned above - the full
motivation is included for the bigger picture discussion, including LSF/MM.
The main part of this patchset is the move of collapse scanning from
khugepaged to the task_work context. This has been already discussed as a good
idea and a RFC has been posted by Alex Thorlton last October [5]. In that
prototype, scanning has been driven from __khugepaged_enter(), which is called
on events such as vma_merge, fork, and THP page fault attempts, e.g. events
that are not exactly periodical. The main difference in my patchset is it being
modeled after the automatic NUMA balancing scanning, i.e. using scheduler's
task_tick_fair(). The second difference is that khugepaged is not disabled
entirely, but repurposed for the costly hugepage allocations. There is a
nodemask for indicating which nodes should have hugepages easily available.
The idea is that hugepage allocation attempts from the process context
(either page fault or the task_work collapsing) would not attempt
reclaim/compaction, and on failure will clear the nodemask bit and wake up
khugepaged to do the hard work and flip the bit back on. If if appears that
ther are no hugepages available, attempts to page fault THP or scan are
suspended.

I have done only light testing so far to see that it works as intended, but
not to prove it's "better" than current state. I wanted to post the RFC before
LSF/MM.

There are known limitations and TODO/FIXME's in the code, for example:
- the scanning period doesn't yet adjust itself based on recent collapse
  success/failure. The idea is that it will double itself if the last full
  mm scan yielded no collapse.
- some user-visible counters for the task scanning activity should be added
- the change of THP allocation attempts pressure can have hard to predict
  outcomes on fragmentation and the periods of deferred compaction.
- Documentation should be updated

More stuff is not decided yet:
- moving to task context from khugepaged results in the hugepage allocations
  for collapsing not use sync compaction anymore. But should we also change
  the "defrag" default from "always" to e.g. "madvise", which results in
  ~__GFP_WAIT allocations to further reduce latencies perceived from process
  context?
- Would make sense to have one khugepaged instance per memory node, bound to
  the corresponding CPU's? It would then touch local memory e.g. during
  compaction migrations, and the khugepaged sleeps and wakeups would be more
  fine-grained.
- should we keep using the tunables under /sys/.../khugepaged for the activity
  that is mostly no longer performed by khugepaged, and when e.g. NUMA
  balancing uses sysctl?

The stuff not touched by this patchset (nor decided):
- should we allow the user to disable THP page faults completely (or restrict
  them to MADV_HUGEPAGE vma's?), or just assume that max_ptes_none < 511 means
  it should be disabled, because we cannot know how much of the 2MB the process
  would fault in?
- do we want to further limit collapses when base pages come from different
  NUMA nodes? Another tunable (saying that minimum X pte's should be from
  single node), or just hard-require that all are from the same node?

Patchset was lightly tested on v3.19 + 2 cherry-picked patches 077fcf116c8c and
be97a41b291, and rebased to 4.0-rc1 before sending.

[1] http://marc.info/?l=linux-mm&m=142056088232484&w=2
[2] http://www.spinics.net/lists/kernel/msg1928252.html
[3] http://www.spinics.net/lists/linux-mm/msg84023.html
[4] http://scn.sap.com/people/markmumy/blog/2014/05/22/sap-iq-and-linux-hugepagestransparent-hugepages
[5] https://lkml.org/lkml/2014/10/22/931

Vlastimil Babka (6):
  mm, thp: stop preallocating hugepages in khugepaged
  mm, thp: make khugepaged check for THP allocability before scanning
  mm, thp: try fault allocations only if we expect them to succeed
  mm, thp: move collapsing from khugepaged to task_work context
  mm, thp: wakeup khugepaged when THP allocation fails
  mm, thp: remove no longer needed khugepaged code

 include/linux/khugepaged.h |  19 +-
 include/linux/mm_types.h   |   4 +
 include/linux/sched.h      |   5 +
 kernel/fork.c              |   1 -
 kernel/sched/core.c        |  12 +
 kernel/sched/fair.c        | 124 ++++++++-
 mm/huge_memory.c           | 628 ++++++++++++++-------------------------------
 7 files changed, 339 insertions(+), 454 deletions(-)

-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>