Re: [RFC PATCH] mm, memcg: introduce memory.high.throttle

Shakeel Butt <shakeel.butt@xxxxxxxxx> · Thu, 30 Jan 2025 09:32:15 -0800

On Thu, Jan 30, 2025 at 12:19:38PM -0500, Waiman Long wrote:
> On 1/30/25 12:05 PM, Roman Gushchin wrote:
> > On Thu, Jan 30, 2025 at 10:05:34AM -0500, Waiman Long wrote:
> > > On 1/30/25 3:15 AM, Michal Hocko wrote:
> > > > On Wed 29-01-25 14:12:04, Waiman Long wrote:
> > > > > Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing
> > > > > reclaim over memory.high"), the amount of allocator throttling had
> > > > > increased substantially. As a result, it could be difficult for a
> > > > > misbehaving application that consumes increasing amount of memory from
> > > > > being OOM-killed if memory.high is set. Instead, the application may
> > > > > just be crawling along holding close to the allowed memory.high memory
> > > > > for the current memory cgroup for a very long time especially those
> > > > > that do a lot of memcg charging and uncharging operations.
> > > > > 
> > > > > This behavior makes the upstream Kubernetes community hesitate to
> > > > > use memory.high. Instead, they use only memory.max for memory control
> > > > > similar to what is being done for cgroup v1 [1].
> > > > Why is this a problem for them?
> > > My understanding is that a mishaving container will hold up memory.high
> > > amount of memory for a long time instead of getting OOM killed sooner and be
> > > more productively used elsewhere.
> > > > > To allow better control of the amount of throttling and hence the
> > > > > speed that a misbehving task can be OOM killed, a new single-value
> > > > > memory.high.throttle control file is now added. The allowable range
> > > > > is 0-32.  By default, it has a value of 0 which means maximum throttling
> > > > > like before. Any non-zero positive value represents the corresponding
> > > > > power of 2 reduction of throttling and makes OOM kills easier to happen.
> > > > I do not like the interface to be honest. It exposes an implementation
> > > > detail and casts it into a user API. If we ever need to change the way
> > > > how the throttling is implemented this will stand in the way because
> > > > there will be applications depending on a behavior they were carefuly
> > > > tuned to.
> > > > 
> > > > It is also not entirely sure how is this supposed to be used in
> > > > practice? How do people what kind of value they should use?
> > > Yes, I agree that a user may need to run some trial runs to find a proper
> > > value. Perhaps a simpler binary interface of "off" and "on" may be easier to
> > > understand and use.
> > > > > System administrators can now use this parameter to determine how easy
> > > > > they want OOM kills to happen for applications that tend to consume
> > > > > a lot of memory without the need to run a special userspace memory
> > > > > management tool to monitor memory consumption when memory.high is set.
> > > > Why cannot they achieve the same with the existing events/metrics we
> > > > already do provide? Most notably PSI which is properly accounted when
> > > > a task is throttled due to memory.high throttling.
> > > That will require the use of a userspace management agent that looks for
> > > these stalling conditions and make the kill, if necessary. There are
> > > certainly users out there that want to get some benefit of using memory.high
> > > like early memory reclaim without the trouble of handling these kind of
> > > stalling conditions.
> > So you basically want to force the workload into some sort of a proactive
> > reclaim but without an artificial slow down?

I wouldn't call it a proactive reclaim as reclaim will happen
synchronously in allocating thread.

> > It makes some sense to me, but
> > 1) Idk if it deserves a new API, because it can be relatively easy implemented
> >    in userspace by a daemon which monitors cgroups usage and reclaims the memory
> >    if necessarily. No kernel changes are needed.
> > 2) If new API is introduced, I think it's better to introduce a new limit,
> >    e.g. memory.target, keeping memory.high semantics intact.
> 
> Yes, you are right about that. Introducing a new "memory.target" without
> disturbing the existing "memory.high" semantics will work for me too.
> 

So, what happens if reclaim can not reduce usage below memory.target?
Infinite reclaim cycles or just give up?

> Cheers,
> Longman
>