Re: [PATCH] Prevent OOM casualties by enforcing memcg limits

Chris Down <chris@xxxxxxxxxxxxxx> · Tue, 27 Apr 2021 01:09:20 +0100

Hi Alexander,

Alexander Sosna writes:
Before this commit memory cgroup limits were not enforced during
allocation.  If a process within a cgroup tries to allocates more
memory than allowed, the kernel will not prevent the allocation even if
OVERCOMMIT_NEVER is set.  Than the OOM killer is activated to kill
processes in the corresponding cgroup.

Unresolvable cgroup overages are indifferent to vm.overcommit_memory, since 
exceeding memory.max is not overcommitment, it's just a natural consequence of 
the fact that allocation and reclaim are not atomic processes. Overcommitment, 
on the other hand, is about the bounds of available memory at the global 
resource level.

This behavior is not to be expected
when setting OVERCOMMIT_NEVER (vm.overcommit_memory = 2) and it is a huge
problem for applications assuming that the kernel will deny an allocation
if not enough memory is available, like PostgreSQL.  To prevent this a
check is implemented to not allow a process to allocate more memory than
limited by it's cgroup.  This means a process will not be killed while
accessing pages but will receive errors on memory allocation as
appropriate.  This gives programs a chance to handle memory allocation
failures gracefully instead of being reaped.

We don't guarantee that vm.overcommit_memory 2 means "no OOM killer". It can 
still happen for a bunch of reasons, so I really hope PostgreSQL isn't relying 
on that.

Could you please be more clear about the "huge problem" being solved here? I'm 
not seeing it.

Signed-off-by: Alexander Sosna <alexander@xxxxxxxx>

diff --git a/mm/util.c b/mm/util.c
index a8bf17f18a81..c84b83c532c6 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -853,6 +853,7 @@ EXPORT_SYMBOL_GPL(vm_memory_committed);
 *
 * Strict overcommit modes added 2002 Feb 26 by Alan Cox.
 * Additional code 2002 Jul 20 by Robert Love.
+ * Code to enforce memory cgroup limits added 2021 by Alexander Sosna.
 *
 * cap_sys_admin is 1 if the process has admin privileges, 0 otherwise.
 *
@@ -891,6 +892,34 @@ int __vm_enough_memory(struct mm_struct *mm, long
pages, int cap_sys_admin)
		long reserve = sysctl_user_reserve_kbytes >> (PAGE_SHIFT - 10);

		allowed -= min_t(long, mm->total_vm / 32, reserve);
+
+#ifdef CONFIG_MEMCG
+		/*
+		 * If we are in a memory cgroup we also evaluate if the cgroup
+		 * has enough memory to allocate a new virtual mapping.

This comment confuses me further, I'm afraid. You're talking about virtual 
mappings, but then checking memory.max, which is about allocated pages.

+		 * This is how we can keep processes from exceeding their
+		 * limits and also prevent that the OOM killer must be
+		 * awakened.  This gives programs a chance to handle memory
+		 * allocation failures gracefully and not being reaped.
+		 * In the current version mem_cgroup_get_max() is used which
+		 * allows the processes to exceeded their memory limits if
+		 * enough SWAP is available.  If this is not intended we could
+		 * use READ_ONCE(memcg->memory.max) instead.
+		 *
+		 * This code is only reached if sysctl_overcommit_memory equals
+		 * OVERCOMMIT_NEVER, both other options are handled above.
+		 */
+		{
+			struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
+
+			if (memcg) {
+				long available = mem_cgroup_get_max(memcg)
+						- mem_cgroup_size(memcg);
+
+				allowed = min_t(long, available, allowed);
+			}
+		}
+#endif
	}

	if (percpu_counter_read_positive(&vm_committed_as) < allowed)