[nacked] memcg-introduce-per-memcg-reclaim-interface.patch removed from -mm tree

akpm@xxxxxxxxxxxxxxxxxxxx · Tue, 22 Sep 2020 10:35:57 -0700

The patch titled
     Subject: memcg: introduce per-memcg reclaim interface
has been removed from the -mm tree.  Its filename was
     memcg-introduce-per-memcg-reclaim-interface.patch

This patch was dropped because it was nacked

------------------------------------------------------
From: Shakeel Butt <shakeelb@xxxxxxxxxx>
Subject: memcg: introduce per-memcg reclaim interface

Introduce an memcg interface to trigger memory reclaim on a memory cgroup.

Use cases:
----------

1) Per-memcg uswapd:

   Usually applications consists of combination of latency sensitive
   and latency tolerant tasks.  For example, tasks serving user requests
   vs tasks doing data backup for a database application.  At the moment
   the kernel does not differentiate between such tasks when the
   application hits the memcg limits.  So, potentially a latency sensitive
   user facing task can get stuck in high reclaim and be throttled by the
   kernel.

   Similarly there are cases of single process applications having two
   set of thread pools where threads from one pool have high scheduling
   priority and low latency requirement.  One concrete example from our
   production is the VMM which have high priority low latency thread pool
   for the VCPUs while separate thread pool for stats reporting, I/O
   emulation, health checks and other managerial operations.  The kernel
   memory reclaim does not differentiate between VCPU thread or a
   non-latency sensitive thread and a VCPU thread can get stuck in high
   reclaim.

   One way to resolve this issue is to preemptively trigger the memory
   reclaim from a latency tolerant task (uswapd) when the application is
   near the limits.  Finding 'near the limits' situation is an orthogonal
   problem.

2) Proactive reclaim:

   This is a similar to the previous use-case, the difference is
   instead of waiting for the application to be near its limit to trigger
   memory reclaim, continuously pressuring the memcg to reclaim a small
   amount of memory.  This gives more accurate and uptodate workingset
   estimation as the LRUs are continuously sorted and can potentially
   provide more deterministic memory overcommit behavior.  The memory
   overcommit controller can provide more proactive response to the
   changing behavior of the running applications instead of being
   reactive.

Benefit of user space solution:
-------------------------------

1) More flexible on who should be charged for the cpu of the memory
   reclaim.  For proactive reclaim, it makes more sense to centralized the
   overhead while for uswapd, it makes more sense for the application to
   pay for the cpu of the memory reclaim.

2) More flexible on dedicating the resources (like cpu).  The memory
   overcommit controller can balance the cost between the cpu usage and
   the memory reclaimed.

3) Provides a way to the applications to keep their LRUs sorted, so,
   under memory pressure better reclaim candidates are selected.  This
   also gives more accurate and uptodate notion of working set for an
   application.

Questions:
----------

1) Why memory.high is not enough?

   memory.high can be used to trigger reclaim in a memcg and can
   potentially be used for proactive reclaim as well as uswapd use cases. 
   However there is a big negative in using memory.high.  It can
   potentially introduce high reclaim stalls in the target application as
   the allocations from the processes or the threads of the application
   can hit the temporary memory.high limit.

   Another issue with memory.high is that it is not delegatable.  To
   actually use this interface for uswapd, the application has to
   introduce another layer of cgroup on whose memory.high it has write
   access.

2) Why uswapd safe from self induced reclaim?

   This is very similar to the scenario of oomd under global memory
   pressure.  We can use the similar mechanisms to protect uswapd from
   self induced reclaim i.e.  memory.min and mlock.

Interface options:
------------------

Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to
trigger reclaim in the target memory cgroup.

In future we might want to reclaim specific type of memory from a memcg,
so, this interface can be extended to allow that. e.g.

$ echo 10M [all|anon|file|kmem] > memory.reclaim

However that should be when we have concrete use-cases for such
functionality. Keep things simple for now.

Link: https://lkml.kernel.org/r/20200909215752.1725525-1-shakeelb@xxxxxxxxxx
Signed-off-by: Shakeel Butt <shakeelb@xxxxxxxxxx>
Reviewed-by: SeongJae Park <sjpark@xxxxxxxxx>
Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
Cc: Roman Gushchin <guro@xxxxxx>
Cc: Michal Hocko <mhocko@xxxxxxxxxx>
Cc: Yang Shi <yang.shi@xxxxxxxxxxxxxxxxx>
Cc: Greg Thelen <gthelen@xxxxxxxxxx>
Cc: David Rientjes <rientjes@xxxxxxxxxx>
Cc: "Michal Koutný" <mkoutny@xxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

 Documentation/admin-guide/cgroup-v2.rst |    9 +++++
 mm/memcontrol.c                         |   37 ++++++++++++++++++++++
 2 files changed, 46 insertions(+)

--- a/Documentation/admin-guide/cgroup-v2.rst~memcg-introduce-per-memcg-reclaim-interface
+++ a/Documentation/admin-guide/cgroup-v2.rst
@@ -1181,6 +1181,15 @@ PAGE_SIZE multiple when read back.
 	high limit is used and monitored properly, this limit's
 	utility is limited to providing the final safety net.
 
+  memory.reclaim
+	A write-only file which exists on non-root cgroups.
+
+	This is a simple interface to trigger memory reclaim in the
+	target cgroup. Write the number of bytes to reclaim to this
+	file and the kernel will try to reclaim that much memory.
+	Please note that the kernel can over or under reclaim from
+	the target cgroup.
+
   memory.oom.group
 	A read-write single value file which exists on non-root
 	cgroups.  The default value is "0".
--- a/mm/memcontrol.c~memcg-introduce-per-memcg-reclaim-interface
+++ a/mm/memcontrol.c
@@ -6403,6 +6403,38 @@ static ssize_t memory_oom_group_write(st
 	return nbytes;
 }
 
+static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
+			      size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
+	unsigned long nr_to_reclaim, nr_reclaimed = 0;
+	int err;
+
+	buf = strstrip(buf);
+	err = page_counter_memparse(buf, "", &nr_to_reclaim);
+	if (err)
+		return err;
+
+	while (nr_reclaimed < nr_to_reclaim) {
+		unsigned long reclaimed;
+
+		if (signal_pending(current))
+			break;
+
+		reclaimed = try_to_free_mem_cgroup_pages(memcg,
+						nr_to_reclaim - nr_reclaimed,
+						GFP_KERNEL, true);
+
+		if (!reclaimed && !nr_retries--)
+			break;
+
+		nr_reclaimed += reclaimed;
+	}
+
+	return nbytes;
+}
+
 static struct cftype memory_files[] = {
 	{
 		.name = "current",
@@ -6455,6 +6487,11 @@ static struct cftype memory_files[] = {
 		.seq_show = memory_oom_group_show,
 		.write = memory_oom_group_write,
 	},
+	{
+		.name = "reclaim",
+		.flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
+		.write = memory_reclaim,
+	},
 	{ }	/* terminate */
 };
 
_

Patches currently in -mm which might be from shakeelb@xxxxxxxxxx are