Re: [RFC PATCH] mm, memcg: introduce memory.high.throttle

Waiman Long <llong@xxxxxxxxxx> · Thu, 30 Jan 2025 12:07:31 -0500

On 1/30/25 11:39 AM, Johannes Weiner wrote:
On Thu, Jan 30, 2025 at 09:52:29AM -0500, Waiman Long wrote:
On 1/29/25 3:10 PM, Yosry Ahmed wrote:
On Wed, Jan 29, 2025 at 02:12:04PM -0500, Waiman Long wrote:
Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing
reclaim over memory.high"), the amount of allocator throttling had
increased substantially. As a result, it could be difficult for a
misbehaving application that consumes increasing amount of memory from
being OOM-killed if memory.high is set. Instead, the application may
just be crawling along holding close to the allowed memory.high memory
for the current memory cgroup for a very long time especially those
that do a lot of memcg charging and uncharging operations.

This behavior makes the upstream Kubernetes community hesitate to
use memory.high. Instead, they use only memory.max for memory control
similar to what is being done for cgroup v1 [1].

To allow better control of the amount of throttling and hence the
speed that a misbehving task can be OOM killed, a new single-value
memory.high.throttle control file is now added. The allowable range
is 0-32.  By default, it has a value of 0 which means maximum throttling
like before. Any non-zero positive value represents the corresponding
power of 2 reduction of throttling and makes OOM kills easier to happen.

System administrators can now use this parameter to determine how easy
they want OOM kills to happen for applications that tend to consume
a lot of memory without the need to run a special userspace memory
management tool to monitor memory consumption when memory.high is set.

Below are the test results of a simple program showing how different
values of memory.high.throttle can affect its run time (in secs) until
it gets OOM killed. This test program allocates pages from kernel
continuously. There are some run-to-run variations and the results
are just one possible set of samples.

    # systemd-run -p MemoryHigh=10M -p MemoryMax=20M -p MemorySwapMax=10M \
	--wait -t timeout 300 /tmp/mmap-oom

    memory.high.throttle	service runtime
    --------------------	---------------
              0		    120.521
              1		    103.376
              2		     85.881
              3		     69.698
              4		     42.668
              5		     45.782
              6		     22.179
              7		      9.909
              8		      5.347
              9		      3.100
             10		      1.757
             11		      1.084
             12		      0.919
             13		      0.650
             14		      0.650
             15		      0.655

[1] https://docs.google.com/document/d/1mY0MTT34P-Eyv5G1t_Pqs4OWyIH-cg9caRKWmqYlSbI/edit?tab=t.0

Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
---
   Documentation/admin-guide/cgroup-v2.rst | 16 ++++++++--
   include/linux/memcontrol.h              |  2 ++
   mm/memcontrol.c                         | 41 +++++++++++++++++++++++++
   3 files changed, 57 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index cb1b4e759b7e..df9410ad8b3b 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1291,8 +1291,20 @@ PAGE_SIZE multiple when read back.
   	Going over the high limit never invokes the OOM killer and
   	under extreme conditions the limit may be breached. The high
   	limit should be used in scenarios where an external process
-	monitors the limited cgroup to alleviate heavy reclaim
-	pressure.
+	monitors the limited cgroup to alleviate heavy reclaim pressure
+	unless a high enough value is set in "memory.high.throttle".
+
+  memory.high.throttle
+	A read-write single value file which exists on non-root
+	cgroups.  The default is 0.
+
+	Memory usage throttle control.	This value controls the amount
+	of throttling that will be applied when memory consumption
+	exceeds the "memory.high" limit.  The larger the value is,
+	the smaller the amount of throttling will be and the easier an
+	offending application may get OOM killed.
memory.high is supposed to never invoke the OOM killer (see above). It's
unclear to me if you are referring to OOM kills from the kernel or
userspace in the commit message. If the latter, I think it shouldn't be
in kernel docs.
I am sorry for not being clear. What I meant is that if an application
is consuming more memory than what can be recovered by memory reclaim,
it will reach memory.max faster, if set, and get OOM killed. Will
clarify that in the next version.
You're not really supposed to use max and high in conjunction. One is
for kernel OOM killing, the other for userspace OOM killing. That's tho
what the documentation that you edited is trying to explain.

What's the usecase you have in mind?

That is new to me that high and max are not supposed to be used 
together. One problem with v1 is that by the time the limit is reached 
and memory reclaim is not able to recover enough memory in time, the 
task will be OOM killed. I always thought that by setting high to a bit 
below max, say 90%, early memory reclaim will reduce the chance of OOM 
kills. There are certainly others that think like that.

So the use case here is to reduce the chance of OOM kills without 
letting really mishaving tasks from holding up useful memory for too long.

Cheers,
Longman