Re: [PATCH] memcg: Add a new sysctl parameter for automatically setting memory.high

Waiman Long <longman@xxxxxxxxxx> · Mon, 24 Jun 2024 12:33:27 -0400

On 6/24/24 11:21, Roman Gushchin wrote:
On Sun, Jun 23, 2024 at 04:52:00PM -0400, Waiman Long wrote:
Correct some email addresses.

On 6/23/24 16:45, Waiman Long wrote:
With memory cgroup v1, there is only a single "memory.limit_in_bytes"
to be set to specify the maximum amount of memory that is allowed to
be used. So a lot of memory cgroup using tools and applications allow
users to specify a single memory limit. When they migrate to cgroup
v2, they use the given memory limit to set memory.max and disregard
memory.high for the time being.

Without properly setting memory.high, these user space applications
cannot make use of the memory cgroup v2 ability to further reduce the
chance of OOM kills by throttling and early memory reclaim.

This patch adds a new sysctl parameter "vm/memory_high_autoset_ratio"
to enable setting "memory.high" automatically whenever "memory.max" is
set as long as "memory.high" hasn't been explicitly set before. This
will allow a system administrator or a middleware layer to greatly
reduce the chance of memory cgroup OOM kills without worrying about
how to properly set memory.high.

The new sysctl parameter will allow a range of 0-100. The default value
of 0 will disable memory.high auto setting. For any non-zero value "n",
the actual ratio used will be "n/(n+1)". A user cannot set a fraction
less than 1/2.
Hi Waiman,

I'm not sure that setting memory.high is always a good idea (it comes
with a certain cost, e.g. can increase latency), but even if it is,
why systemd or similar userspace tools can't do this?

We actually have a OOM problem with OpenShift which is based on 
Kubernetes. AFAIK, the setting of memory.high is still in alpha for 
Kubernetes. So a memory cgroup is set up just by setting memory.max at 
the moment.

I also trace back the OOM problem to commit 14aa8b2d5c2e ("mm/mglru: 
don't sync disk for each aging cycle") in the MGLRU code. So setting 
memory.high automatically is one way to avoid premature OOM. That is the 
motivation behind this patch.

I wonder what's special about your case if you do see a lot of OOMs
which can be avoided by setting memory.high? Do you have a bursty workload?

In our case, the OOM kill can be triggered by writing a large data file 
that exceeds memory.max to a NFS mounted filesystem as long as there is 
enough free pages that the dirty_bytes/dirty_background_bytes mechanism 
isn't triggered.

Regards,
Longman