+ mm-prepare-for-swap-over-high-accounting-and-penalty-calculation.patch added to -mm tree

Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> · Wed, 27 May 2020 14:33:01 -0700

The patch titled
     Subject: mm/memcg: prepare for swap over-high accounting and penalty calculation
has been added to the -mm tree.  Its filename is
     mm-prepare-for-swap-over-high-accounting-and-penalty-calculation.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-prepare-for-swap-over-high-accounting-and-penalty-calculation.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-prepare-for-swap-over-high-accounting-and-penalty-calculation.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Jakub Kicinski <kuba@xxxxxxxxxx>
Subject: mm/memcg: prepare for swap over-high accounting and penalty calculation

Patch series "memcg: Slow down swap allocation as the available space gets
depleted", v6.

Tejun describes the problem as follows:

When swap runs out, there's an abrupt change in system behavior - the
anonymous memory suddenly becomes unmanageable which readily breaks any
sort of memory isolation and can bring down the whole system.  To avoid
that, oomd [1] monitors free swap space and triggers kills when it drops
below the specific threshold (e.g.  15%).

While this works, it's far from ideal:
 - Depending on IO performance and total swap size, a given
   headroom might not be enough or too much.
 - oomd has to monitor swap depletion in addition to the usual
   pressure metrics and it currently doesn't consider memory.swap.max.

Solve this by adapting parts of the approach that memory.high uses - slow
down allocation as the resource gets depleted turning the depletion
behavior from abrupt cliff one to gradual degradation observable through
memory pressure metric.

[1] https://github.com/facebookincubator/oomd


This patch (of 4):

Slice the memory overage calculation logic a little bit so we can reuse it
to apply a similar penalty to the swap.  The logic which accesses the
memory-specific fields (use and high values) has to be taken out of
calculate_high_delay().

Link: http://lkml.kernel.org/r/20200527195846.102707-1-kuba@xxxxxxxxxx
Link: http://lkml.kernel.org/r/20200527195846.102707-2-kuba@xxxxxxxxxx
Signed-off-by: Jakub Kicinski <kuba@xxxxxxxxxx>
Reviewed-by: Shakeel Butt <shakeelb@xxxxxxxxxx>
Acked-by: Johannes Weiner <hannes@xxxxxxxxxxx>
Cc: Hugh Dickins <hughd@xxxxxxxxxx>
Cc: Chris Down <chris@xxxxxxxxxxxxxx>
Cc: Michal Hocko <mhocko@xxxxxxxxxx>
Cc: Tejun Heo <tj@xxxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

 mm/memcontrol.c |   62 +++++++++++++++++++++++++---------------------
 1 file changed, 35 insertions(+), 27 deletions(-)

--- a/mm/memcontrol.c~mm-prepare-for-swap-over-high-accounting-and-penalty-calculation
+++ a/mm/memcontrol.c
@@ -2321,41 +2321,48 @@ static void high_work_func(struct work_s
  #define MEMCG_DELAY_PRECISION_SHIFT 20
  #define MEMCG_DELAY_SCALING_SHIFT 14
 
-/*
- * Get the number of jiffies that we should penalise a mischievous cgroup which
- * is exceeding its memory.high by checking both it and its ancestors.
- */
-static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
-					  unsigned int nr_pages)
+static u64 calculate_overage(unsigned long usage, unsigned long high)
 {
-	unsigned long penalty_jiffies;
-	u64 max_overage = 0;
+	u64 overage;
 
-	do {
-		unsigned long usage, high;
-		u64 overage;
+	if (usage <= high)
+		return 0;
 
-		usage = page_counter_read(&memcg->memory);
-		high = READ_ONCE(memcg->high);
+	/*
+	 * Prevent division by 0 in overage calculation by acting as if
+	 * it was a threshold of 1 page
+	 */
+	high = max(high, 1UL);
 
-		if (usage <= high)
-			continue;
+	overage = usage - high;
+	overage <<= MEMCG_DELAY_PRECISION_SHIFT;
+	return div64_u64(overage, high);
+}
 
-		/*
-		 * Prevent division by 0 in overage calculation by acting as if
-		 * it was a threshold of 1 page
-		 */
-		high = max(high, 1UL);
-
-		overage = usage - high;
-		overage <<= MEMCG_DELAY_PRECISION_SHIFT;
-		overage = div64_u64(overage, high);
+static u64 mem_find_max_overage(struct mem_cgroup *memcg)
+{
+	u64 overage, max_overage = 0;
 
-		if (overage > max_overage)
-			max_overage = overage;
+	do {
+		overage = calculate_overage(page_counter_read(&memcg->memory),
+					    READ_ONCE(memcg->high));
+		max_overage = max(overage, max_overage);
 	} while ((memcg = parent_mem_cgroup(memcg)) &&
 		 !mem_cgroup_is_root(memcg));
 
+	return max_overage;
+}
+
+/*
+ * Get the number of jiffies that we should penalise a mischievous cgroup which
+ * is exceeding its memory.high by checking both it and its ancestors.
+ */
+static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
+					  unsigned int nr_pages,
+					  u64 max_overage)
+{
+	unsigned long penalty_jiffies;
+
 	if (!max_overage)
 		return 0;
 
@@ -2411,7 +2418,8 @@ void mem_cgroup_handle_over_high(void)
 	 * memory.high is breached and reclaim is unable to keep up. Throttle
 	 * allocators proactively to slow down excessive growth.
 	 */
-	penalty_jiffies = calculate_high_delay(memcg, nr_pages);
+	penalty_jiffies = calculate_high_delay(memcg, nr_pages,
+					       mem_find_max_overage(memcg));
 
 	/*
 	 * Don't sleep if the amount of jiffies this memcg owes us is so low
_

Patches currently in -mm which might be from kuba@xxxxxxxxxx are

mm-prepare-for-swap-over-high-accounting-and-penalty-calculation.patch
mm-move-penalty-delay-clamping-out-of-calculate_high_delay.patch
mm-move-cgroup-high-memory-limit-setting-into-struct-page_counter.patch
mm-automatically-penalize-tasks-with-high-swap-use.patch