[to-be-updated] mm-workingset-dont-drop-refault-information-prematurely.patch removed from -mm tree

akpm@xxxxxxxxxxxxxxxxxxxx · Tue, 17 Jul 2018 13:27:36 -0700

The patch titled
     Subject: mm: workingset: don't drop refault information prematurely
has been removed from the -mm tree.  Its filename was
     mm-workingset-dont-drop-refault-information-prematurely.patch

This patch was dropped because an updated version will be merged

------------------------------------------------------
From: Johannes Weiner <jweiner@xxxxxx>
Subject: mm: workingset: don't drop refault information prematurely

Patch series "psi: pressure stall information for CPU, memory, and IO", v2.

PSI aggregates and reports the overall wallclock time in which the tasks
in a system (or cgroup) wait for contended hardware resources.

This helps users understand the resource pressure their workloads are
under, which allows them to rootcause and fix throughput and latency
problems caused by overcommitting, underprovisioning, suboptimal job
placement in a grid, as well as anticipate major disruptions like OOM.

This version 2 of the series incorporates a ton of feedback from PeterZ
and SurenB; more details at the end of this email.

		Real-world applications

We're using the data collected by psi (and its previous incarnation,
memdelay) quite extensively at Facebook, with several success stories.

One usecase is avoiding OOM hangs/livelocks.  The reason these happen is
because the OOM killer is triggered by reclaim not being able to free
pages, but with fast flash devices there is *always* some clean and
uptodate cache to reclaim; the OOM killer never kicks in, even as tasks
spend 90% of the time thrashing the cache pages of their own executables. 
There is no situation where this ever makes sense in practice.  We wrote a
<100 line POC python script to monitor memory pressure and kill stuff way
before such pathological thrashing leads to full system losses that
require forcible hard resets.

We've since extended and deployed this code into other places to guarantee
latency and throughput SLAs, since they're usually violated way before the
kernel OOM killer would ever kick in.

The idea is to eventually incorporate this back into the kernel, so that
Linux can avoid OOM livelocks (which TECHNICALLY aren't memory deadlocks,
but for the user indistinguishable) out of the box.

We also use psi memory pressure for loadshedding.  Our batch job
infrastructure used to use heuristics based on various VM stats to
anticipate OOM situations, with lackluster success.  We switched it to psi
and managed to anticipate and avoid OOM kills and hangs fairly reliably. 
The reduction of OOM outages in the worker pool raised the pool's
aggregate productivity, and we were able to switch that service to smaller
machines.

Lastly, we use cgroups to isolate a machine's main workload from
maintenance crap like package upgrades, logging, configuration, as well as
to prevent multiple workloads on a machine from stepping on each others'
toes.  We were not able to configure this properly without the pressure
metrics; we would see latency or bandwidth drops, but it would often be
hard to impossible to rootcause it post-mortem.

We now log and graph pressure for the containers in our fleet and can
trivially link latency spikes and throughput drops to shortages of
specific resources after the fact, and fix the job config/scheduling.

I've also recieved feedback and feature requests from Android for the
purpose of low-latency OOM killing.  The on-demand stats aggregation in
the last patch of this series is for this purpose, to allow Android to
react to pressure before the system starts visibly hanging.

		How do you use this feature?

A kernel with CONFIG_PSI=y will create a /proc/pressure directory with 3
files: cpu, memory, and io.  If using cgroup2, cgroups will also have
cpu.pressure, memory.pressure and io.pressure files, which simply
aggregate task stalls at the cgroup level instead of system-wide.

The cpu file contains one line:

	some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722

The averages give the percentage of walltime in which one or more tasks
are delayed on the runqueue while another task has the CPU.  They're
recent averages over 10s, 1m, 5m windows, so you can tell short term
trends from long term ones, similarly to the load average.

The total= value gives the absolute stall time in microseconds.  This
allows detecting latency spikes that might be too short to sway the
running averages.  It also allows custom time averaging in case the
10s/1m/5m windows aren't adequate for the usecase (or are too coarse with
future hardware).

What to make of this "some" metric?  If CPU utilization is at 100% and CPU
pressure is 0, it means the system is perfectly utilized, with one
runnable thread per CPU and nobody waiting.  At two or more runnable tasks
per CPU, the system is 100% overcommitted and the pressure average will
indicate as much.  From a utilization perspective this is a great state of
course: no CPU cycles are being wasted, even when 50% of the threads were
to go idle (as most workloads do vary).  From the perspective of the
individual job it's not great, however, and they would do better with more
resources.  Depending on what your priority and options are, raised "some"
numbers may or may not require action.

The memory file contains two lines:

some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828
full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258

The some line is the same as for cpu, the time in which at least one task
is stalled on the resource.  In the case of memory, this includes waiting
on swap-in, page cache refaults and page reclaim.

The full line, however, indicates time in which *nobody* is using the CPU
productively due to pressure: all non-idle tasks are waiting for memory in
one form or another.  Significant time spent in there is a good trigger
for killing things, moving jobs to other machines, or dropping incoming
requests, since neither the jobs nor the machine overall are making too
much headway.

The io file is similar to memory.  Because the block layer doesn't have a
concept of hardware contention right now (how much longer is my IO request
taking due to other tasks?), it reports CPU potential lost on all IO
delays, not just the potential lost due to competition.

		FAQ

Q: How is PSI's CPU component different from the load average?

A: There are several quirks in the load average that make it hard to
   impossible to tell how overcommitted the CPU really is.

   1. The load average is reported as a raw number of active tasks.
      You need to know how many CPUs there are in the system, how many
      CPUs the workload is allowed to use, then think about what the
      proportion between load and the number of CPUs means for the
      tasks trying to run.

      PSI reports the percentage of wallclock time in which tasks are
      waiting for a CPU to run on. It doesn't matter how many CPUs are
      present or usable. The number always tells the quality of life
      of tasks in the system or in a particular cgroup.

   2. The shortest averaging window is 1m, which is extremely coarse,
      and it's sampled in 5s intervals. A *lot* can happen on a CPU in
      5 seconds. This *may* be able to identify persistent long-term
      trends and very clear and obvious overloads, but it's unusable
      for latency spikes and more subtle overutilization.

      PSI's shortest window is 10s. It also exports the cumulative
      stall times (in microseconds) of synchronously recorded events.

   3. On Linux, the load average for historical reasons includes all
      TASK_UNINTERRUPTIBLE tasks. This gives a broader sense of how
      busy the system is, but on the flipside it doesn't distinguish
      whether tasks are likely to contend over the CPU or IO - which
      obviously requires very different interventions from a sys admin
      or a job scheduler.

      PSI reports independent metrics for CPU and IO. You can tell
      which resource is making the tasks wait, but in conjunction
      still see how overloaded the system is overall.


This patch (of 10):

If we just keep enough refault information to match the CURRENT page cache
during reclaim time, we could lose a lot of events when there is only a
temporary spike in non-cache memory consumption that pushes out all the
cache.  Once cache comes back, we won't see those refaults.  They might
not be actionable for LRU aging, but we want to know about them for
measuring memory pressure.

Link: http://lkml.kernel.org/r/20180712172942.10094-2-hannes@xxxxxxxxxxx
Signed-off-by: Johannes Weiner <jweiner@xxxxxx>
Tested-by: Daniel Drake <drake@xxxxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Tejun Heo <tj@xxxxxxxxxx>
Cc: Suren Baghdasaryan <surenb@xxxxxxxxxx>
Cc: Vinayak Menon <vinmenon@xxxxxxxxxxxxxx>
Cc: Christopher Lameter <cl@xxxxxxxxx>
Cc: Mike Galbraith <efault@xxxxxx>
Cc: Shakeel Butt <shakeelb@xxxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

 mm/workingset.c |   18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff -puN mm/workingset.c~mm-workingset-dont-drop-refault-information-prematurely mm/workingset.c

--- a/mm/workingset.c~mm-workingset-dont-drop-refault-information-prematurely
+++ a/mm/workingset.c
@@ -364,7 +364,7 @@ static unsigned long count_shadow_nodes(
 {
 	unsigned long max_nodes;
 	unsigned long nodes;
-	unsigned long cache;
+	unsigned long pages;
 
 	nodes = list_lru_shrink_count(&shadow_nodes, sc);
 
@@ -390,14 +390,14 @@ static unsigned long count_shadow_nodes(
 	 *
 	 * PAGE_SIZE / radix_tree_nodes / node_entries * 8 / PAGE_SIZE
 	 */
-	if (sc->memcg) {
-		cache = mem_cgroup_node_nr_lru_pages(sc->memcg, sc->nid,
-						     LRU_ALL_FILE);
-	} else {
-		cache = node_page_state(NODE_DATA(sc->nid), NR_ACTIVE_FILE) +
-			node_page_state(NODE_DATA(sc->nid), NR_INACTIVE_FILE);
-	}
-	max_nodes = cache >> (RADIX_TREE_MAP_SHIFT - 3);
+#ifdef CONFIG_MEMCG
+	if (sc->memcg)
+		pages = page_counter_read(&sc->memcg->memory);
+	else
+#endif
+		pages = node_present_pages(sc->nid);
+
+	max_nodes = pages >> (RADIX_TREE_MAP_SHIFT - 3);
 
 	if (!nodes)
 		return SHRINK_EMPTY;
_

Patches currently in -mm which might be from jweiner@xxxxxx are


--
To unsubscribe from this list: send the line "unsubscribe mm-commits" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html