On 1/31/25 07:19, Johannes Weiner wrote: > On Thu, Jan 30, 2025 at 12:07:31PM -0500, Waiman Long wrote: >> On 1/30/25 11:39 AM, Johannes Weiner wrote: >>> On Thu, Jan 30, 2025 at 09:52:29AM -0500, Waiman Long wrote: >>>> On 1/29/25 3:10 PM, Yosry Ahmed wrote: >>>>> On Wed, Jan 29, 2025 at 02:12:04PM -0500, Waiman Long wrote: >>>>>> Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing >>>>>> reclaim over memory.high"), the amount of allocator throttling had >>>>>> increased substantially. As a result, it could be difficult for a >>>>>> misbehaving application that consumes increasing amount of memory from >>>>>> being OOM-killed if memory.high is set. Instead, the application may >>>>>> just be crawling along holding close to the allowed memory.high memory >>>>>> for the current memory cgroup for a very long time especially those >>>>>> that do a lot of memcg charging and uncharging operations. >>>>>> >>>>>> This behavior makes the upstream Kubernetes community hesitate to >>>>>> use memory.high. Instead, they use only memory.max for memory control >>>>>> similar to what is being done for cgroup v1 [1]. >>>>>> >>>>>> To allow better control of the amount of throttling and hence the >>>>>> speed that a misbehving task can be OOM killed, a new single-value >>>>>> memory.high.throttle control file is now added. The allowable range >>>>>> is 0-32. By default, it has a value of 0 which means maximum throttling >>>>>> like before. Any non-zero positive value represents the corresponding >>>>>> power of 2 reduction of throttling and makes OOM kills easier to happen. >>>>>> >>>>>> System administrators can now use this parameter to determine how easy >>>>>> they want OOM kills to happen for applications that tend to consume >>>>>> a lot of memory without the need to run a special userspace memory >>>>>> management tool to monitor memory consumption when memory.high is set. >>>>>> >>>>>> Below are the test results of a simple program showing how different >>>>>> values of memory.high.throttle can affect its run time (in secs) until >>>>>> it gets OOM killed. This test program allocates pages from kernel >>>>>> continuously. There are some run-to-run variations and the results >>>>>> are just one possible set of samples. >>>>>> >>>>>> # systemd-run -p MemoryHigh=10M -p MemoryMax=20M -p MemorySwapMax=10M \ >>>>>> --wait -t timeout 300 /tmp/mmap-oom >>>>>> >>>>>> memory.high.throttle service runtime >>>>>> -------------------- --------------- >>>>>> 0 120.521 >>>>>> 1 103.376 >>>>>> 2 85.881 >>>>>> 3 69.698 >>>>>> 4 42.668 >>>>>> 5 45.782 >>>>>> 6 22.179 >>>>>> 7 9.909 >>>>>> 8 5.347 >>>>>> 9 3.100 >>>>>> 10 1.757 >>>>>> 11 1.084 >>>>>> 12 0.919 >>>>>> 13 0.650 >>>>>> 14 0.650 >>>>>> 15 0.655 >>>>>> >>>>>> [1] https://docs.google.com/document/d/1mY0MTT34P-Eyv5G1t_Pqs4OWyIH-cg9caRKWmqYlSbI/edit?tab=t.0 >>>>>> >>>>>> Signed-off-by: Waiman Long <longman@xxxxxxxxxx> >>>>>> --- >>>>>> Documentation/admin-guide/cgroup-v2.rst | 16 ++++++++-- >>>>>> include/linux/memcontrol.h | 2 ++ >>>>>> mm/memcontrol.c | 41 +++++++++++++++++++++++++ >>>>>> 3 files changed, 57 insertions(+), 2 deletions(-) >>>>>> >>>>>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst >>>>>> index cb1b4e759b7e..df9410ad8b3b 100644 >>>>>> --- a/Documentation/admin-guide/cgroup-v2.rst >>>>>> +++ b/Documentation/admin-guide/cgroup-v2.rst >>>>>> @@ -1291,8 +1291,20 @@ PAGE_SIZE multiple when read back. >>>>>> Going over the high limit never invokes the OOM killer and >>>>>> under extreme conditions the limit may be breached. The high >>>>>> limit should be used in scenarios where an external process >>>>>> - monitors the limited cgroup to alleviate heavy reclaim >>>>>> - pressure. >>>>>> + monitors the limited cgroup to alleviate heavy reclaim pressure >>>>>> + unless a high enough value is set in "memory.high.throttle". >>>>>> + >>>>>> + memory.high.throttle >>>>>> + A read-write single value file which exists on non-root >>>>>> + cgroups. The default is 0. >>>>>> + >>>>>> + Memory usage throttle control. This value controls the amount >>>>>> + of throttling that will be applied when memory consumption >>>>>> + exceeds the "memory.high" limit. The larger the value is, >>>>>> + the smaller the amount of throttling will be and the easier an >>>>>> + offending application may get OOM killed. >>>>> memory.high is supposed to never invoke the OOM killer (see above). It's >>>>> unclear to me if you are referring to OOM kills from the kernel or >>>>> userspace in the commit message. If the latter, I think it shouldn't be >>>>> in kernel docs. >>>> I am sorry for not being clear. What I meant is that if an application >>>> is consuming more memory than what can be recovered by memory reclaim, >>>> it will reach memory.max faster, if set, and get OOM killed. Will >>>> clarify that in the next version. >>> You're not really supposed to use max and high in conjunction. One is >>> for kernel OOM killing, the other for userspace OOM killing. That's tho >>> what the documentation that you edited is trying to explain. >>> >>> What's the usecase you have in mind? >> >> That is new to me that high and max are not supposed to be used >> together. One problem with v1 is that by the time the limit is reached >> and memory reclaim is not able to recover enough memory in time, the >> task will be OOM killed. I always thought that by setting high to a bit >> below max, say 90%, early memory reclaim will reduce the chance of OOM >> kills. There are certainly others that think like that. > > I can't fault you or them for this, because this was the original plan > for these knobs. However, this didn't end up working in practice. > > If you have a non-throttling, non-killing limit, then reclaim will > either work and keep the workload to that limit; or it won't work, and > the workload escapes to the hard limit and gets killed. > > You'll notice you get the same behavior with just memory.max set by > itself - either reclaim can keep up, or OOM is triggered. Yep that was intentional, it was best effort. > >> So the use case here is to reduce the chance of OOM kills without >> letting really mishaving tasks from holding up useful memory for too long. > > That brings us to the idea of a medium amount of throttling. > > The premise would be that, by throttling *to a certain degree*, you > can slow the workload down just enough to tide over the pressure peak > and avert the OOM kill. > > This assumes that some tasks inside the cgroup can independently make > forward progress and release memory, while allocating tasks inside the > group are already throttled. > > [ Keep in mind, it's a cgroup-internal limit, so no memory freeing > outside of the group can alleviate the situation. Progress must > happen from within the cgroup. ] > > But this sort of parallelism in a pressured cgroup is unlikely in > practice. By the time reclaim fails, usually *every task* in the > cgroup ends up having to allocate. Because they lose executables to > cache reclaim, or heap memory to swap etc, and then page fault. > > We found that more often than not, it just deteriorates into a single > sequence of events. Slowing it down just drags out the inevitable. > > As a result we eventually moved away from the idea of gradual > throttling. The last remnants of this idea finally disappeared from > the docs last year (commit 5647e53f7856bb39dae781fe26aa65a699e2fc9f). > > memory.high now effectively puts the cgroup to sleep when reclaim > fails (similar to oom killer disabling in v1, but without the caveats > of that implementation). This is useful to let userspace implement > custom OOM killing policies. > I've found using memory.high as limiting the way you've defined (using a benchmark like STREAM, the benchmark did not finish and was stalled for several hours when it was short of a few GB's of memory) and I did not have a background user space process to do a user space kill. In my case, reclaim was able to reclaim a little bit, so some forward progress was made and nr_retries limit was never hit (IIRC). Effectively in v1 soft_limit was supposed to be best effort pushing back and OOM kill could find a task to kill globally (initial design) if there was global memory pressure. For this discussion adding memory.high.throttle seems like it's bridging the transition from memory.high to memory.max/OOM without external intervention. I do feel that not killing the task, just locks the task in the memcg forever (at-least in my case) and it sounds like using memory.high requires an external process monitor to kill the task if it does not make progress. Warm Regards, Balbir Singh