Re: [RFC PATCH] mm, memcg: introduce memory.high.throttle

Balbir Singh <balbirs@xxxxxxxxxx> · Fri, 31 Jan 2025 09:27:08 +1100

On 1/31/25 07:19, Johannes Weiner wrote:
> On Thu, Jan 30, 2025 at 12:07:31PM -0500, Waiman Long wrote:
>> On 1/30/25 11:39 AM, Johannes Weiner wrote:
>>> On Thu, Jan 30, 2025 at 09:52:29AM -0500, Waiman Long wrote:
>>>> On 1/29/25 3:10 PM, Yosry Ahmed wrote:
>>>>> On Wed, Jan 29, 2025 at 02:12:04PM -0500, Waiman Long wrote:
>>>>>> Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing
>>>>>> reclaim over memory.high"), the amount of allocator throttling had
>>>>>> increased substantially. As a result, it could be difficult for a
>>>>>> misbehaving application that consumes increasing amount of memory from
>>>>>> being OOM-killed if memory.high is set. Instead, the application may
>>>>>> just be crawling along holding close to the allowed memory.high memory
>>>>>> for the current memory cgroup for a very long time especially those
>>>>>> that do a lot of memcg charging and uncharging operations.
>>>>>>
>>>>>> This behavior makes the upstream Kubernetes community hesitate to
>>>>>> use memory.high. Instead, they use only memory.max for memory control
>>>>>> similar to what is being done for cgroup v1 [1].
>>>>>>
>>>>>> To allow better control of the amount of throttling and hence the
>>>>>> speed that a misbehving task can be OOM killed, a new single-value
>>>>>> memory.high.throttle control file is now added. The allowable range
>>>>>> is 0-32.  By default, it has a value of 0 which means maximum throttling
>>>>>> like before. Any non-zero positive value represents the corresponding
>>>>>> power of 2 reduction of throttling and makes OOM kills easier to happen.
>>>>>>
>>>>>> System administrators can now use this parameter to determine how easy
>>>>>> they want OOM kills to happen for applications that tend to consume
>>>>>> a lot of memory without the need to run a special userspace memory
>>>>>> management tool to monitor memory consumption when memory.high is set.
>>>>>>
>>>>>> Below are the test results of a simple program showing how different
>>>>>> values of memory.high.throttle can affect its run time (in secs) until
>>>>>> it gets OOM killed. This test program allocates pages from kernel
>>>>>> continuously. There are some run-to-run variations and the results
>>>>>> are just one possible set of samples.
>>>>>>
>>>>>>     # systemd-run -p MemoryHigh=10M -p MemoryMax=20M -p MemorySwapMax=10M \
>>>>>> 	--wait -t timeout 300 /tmp/mmap-oom
>>>>>>
>>>>>>     memory.high.throttle	service runtime
>>>>>>     --------------------	---------------
>>>>>>               0		    120.521
>>>>>>               1		    103.376
>>>>>>               2		     85.881
>>>>>>               3		     69.698
>>>>>>               4		     42.668
>>>>>>               5		     45.782
>>>>>>               6		     22.179
>>>>>>               7		      9.909
>>>>>>               8		      5.347
>>>>>>               9		      3.100
>>>>>>              10		      1.757
>>>>>>              11		      1.084
>>>>>>              12		      0.919
>>>>>>              13		      0.650
>>>>>>              14		      0.650
>>>>>>              15		      0.655
>>>>>>
>>>>>> [1] https://docs.google.com/document/d/1mY0MTT34P-Eyv5G1t_Pqs4OWyIH-cg9caRKWmqYlSbI/edit?tab=t.0
>>>>>>
>>>>>> Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
>>>>>> ---
>>>>>>    Documentation/admin-guide/cgroup-v2.rst | 16 ++++++++--
>>>>>>    include/linux/memcontrol.h              |  2 ++
>>>>>>    mm/memcontrol.c                         | 41 +++++++++++++++++++++++++
>>>>>>    3 files changed, 57 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
>>>>>> index cb1b4e759b7e..df9410ad8b3b 100644
>>>>>> --- a/Documentation/admin-guide/cgroup-v2.rst
>>>>>> +++ b/Documentation/admin-guide/cgroup-v2.rst
>>>>>> @@ -1291,8 +1291,20 @@ PAGE_SIZE multiple when read back.
>>>>>>    	Going over the high limit never invokes the OOM killer and
>>>>>>    	under extreme conditions the limit may be breached. The high
>>>>>>    	limit should be used in scenarios where an external process
>>>>>> -	monitors the limited cgroup to alleviate heavy reclaim
>>>>>> -	pressure.
>>>>>> +	monitors the limited cgroup to alleviate heavy reclaim pressure
>>>>>> +	unless a high enough value is set in "memory.high.throttle".
>>>>>> +
>>>>>> +  memory.high.throttle
>>>>>> +	A read-write single value file which exists on non-root
>>>>>> +	cgroups.  The default is 0.
>>>>>> +
>>>>>> +	Memory usage throttle control.	This value controls the amount
>>>>>> +	of throttling that will be applied when memory consumption
>>>>>> +	exceeds the "memory.high" limit.  The larger the value is,
>>>>>> +	the smaller the amount of throttling will be and the easier an
>>>>>> +	offending application may get OOM killed.
>>>>> memory.high is supposed to never invoke the OOM killer (see above). It's
>>>>> unclear to me if you are referring to OOM kills from the kernel or
>>>>> userspace in the commit message. If the latter, I think it shouldn't be
>>>>> in kernel docs.
>>>> I am sorry for not being clear. What I meant is that if an application
>>>> is consuming more memory than what can be recovered by memory reclaim,
>>>> it will reach memory.max faster, if set, and get OOM killed. Will
>>>> clarify that in the next version.
>>> You're not really supposed to use max and high in conjunction. One is
>>> for kernel OOM killing, the other for userspace OOM killing. That's tho
>>> what the documentation that you edited is trying to explain.
>>>
>>> What's the usecase you have in mind?
>>
>> That is new to me that high and max are not supposed to be used 
>> together. One problem with v1 is that by the time the limit is reached 
>> and memory reclaim is not able to recover enough memory in time, the 
>> task will be OOM killed. I always thought that by setting high to a bit 
>> below max, say 90%, early memory reclaim will reduce the chance of OOM 
>> kills. There are certainly others that think like that.
> 
> I can't fault you or them for this, because this was the original plan
> for these knobs. However, this didn't end up working in practice.
> 
> If you have a non-throttling, non-killing limit, then reclaim will
> either work and keep the workload to that limit; or it won't work, and
> the workload escapes to the hard limit and gets killed.
> 
> You'll notice you get the same behavior with just memory.max set by
> itself - either reclaim can keep up, or OOM is triggered.

Yep that was intentional, it was best effort.

> 
>> So the use case here is to reduce the chance of OOM kills without 
>> letting really mishaving tasks from holding up useful memory for too long.
> 
> That brings us to the idea of a medium amount of throttling.
> 
> The premise would be that, by throttling *to a certain degree*, you
> can slow the workload down just enough to tide over the pressure peak
> and avert the OOM kill.
> 
> This assumes that some tasks inside the cgroup can independently make
> forward progress and release memory, while allocating tasks inside the
> group are already throttled.
> 
> [ Keep in mind, it's a cgroup-internal limit, so no memory freeing
>   outside of the group can alleviate the situation. Progress must
>   happen from within the cgroup. ]
> 
> But this sort of parallelism in a pressured cgroup is unlikely in
> practice. By the time reclaim fails, usually *every task* in the
> cgroup ends up having to allocate. Because they lose executables to
> cache reclaim, or heap memory to swap etc, and then page fault.
> 
> We found that more often than not, it just deteriorates into a single
> sequence of events. Slowing it down just drags out the inevitable.
> 
> As a result we eventually moved away from the idea of gradual
> throttling. The last remnants of this idea finally disappeared from
> the docs last year (commit 5647e53f7856bb39dae781fe26aa65a699e2fc9f).
> 
> memory.high now effectively puts the cgroup to sleep when reclaim
> fails (similar to oom killer disabling in v1, but without the caveats
> of that implementation). This is useful to let userspace implement
> custom OOM killing policies.
> 

I've found using memory.high as limiting the way you've defined (using a benchmark
like STREAM, the benchmark did not finish and was stalled for several hours when
it was short of a few GB's of memory) and I did not have a background user space process
to do a user space kill. In my case, reclaim was able to reclaim a little bit, so some
forward progress was made and nr_retries limit was never hit (IIRC).

Effectively in v1 soft_limit was supposed to be best effort pushing back and OOM kill
could find a task to kill globally (initial design) if there was global memory pressure.

For this discussion adding memory.high.throttle seems like it's bridging the transition
from memory.high to memory.max/OOM without external intervention. I do feel that not
killing the task, just locks the task in the memcg forever (at-least in my case) and
it sounds like using memory.high requires an external process monitor to kill the task
if it does not make progress.

Warm Regards,
Balbir Singh