On Tue, Jul 14, 2020 at 1:41 AM Michal Hocko <mhocko@xxxxxxxxxx> wrote: > > On Fri 10-07-20 12:19:37, Shakeel Butt wrote: > > On Fri, Jul 10, 2020 at 11:42 AM Roman Gushchin <guro@xxxxxx> wrote: > > > > > > On Fri, Jul 10, 2020 at 07:12:22AM -0700, Shakeel Butt wrote: > > > > On Fri, Jul 10, 2020 at 5:29 AM Michal Hocko <mhocko@xxxxxxxxxx> wrote: > > > > > > > > > > On Thu 09-07-20 12:47:18, Roman Gushchin wrote: > > > > > > Memory.high limit is implemented in a way such that the kernel > > > > > > penalizes all threads which are allocating a memory over the limit. > > > > > > Forcing all threads into the synchronous reclaim and adding some > > > > > > artificial delays allows to slow down the memory consumption and > > > > > > potentially give some time for userspace oom handlers/resource control > > > > > > agents to react. > > > > > > > > > > > > It works nicely if the memory usage is hitting the limit from below, > > > > > > however it works sub-optimal if a user adjusts memory.high to a value > > > > > > way below the current memory usage. It basically forces all workload > > > > > > threads (doing any memory allocations) into the synchronous reclaim > > > > > > and sleep. This makes the workload completely unresponsive for > > > > > > a long period of time and can also lead to a system-wide contention on > > > > > > lru locks. It can happen even if the workload is not actually tight on > > > > > > memory and has, for example, a ton of cold pagecache. > > > > > > > > > > > > In the current implementation writing to memory.high causes an atomic > > > > > > update of page counter's high value followed by an attempt to reclaim > > > > > > enough memory to fit into the new limit. To fix the problem described > > > > > > above, all we need is to change the order of execution: try to push > > > > > > the memory usage under the limit first, and only then set the new > > > > > > high limit. > > > > > > > > > > Shakeel would this help with your pro-active reclaim usecase? It would > > > > > require to reset the high limit right after the reclaim returns which is > > > > > quite ugly but it would at least not require a completely new interface. > > > > > You would simply do > > > > > high = current - to_reclaim > > > > > echo $high > memory.high > > > > > echo infinity > memory.high # To prevent direct reclaim > > > > > # allocation stalls > > > > > > > > > > > > > This will reduce the chance of stalls but the interface is still > > > > non-delegatable i.e. applications can not change their own memory.high > > > > for the use-cases like application controlled proactive reclaim and > > > > uswapd. > > > > > > Can you, please, elaborate a bit more on this? I didn't understand > > > why. > > > > > > > Sure. Do we want memory.high a CFTYPE_NS_DELEGATABLE type file? I > > don't think so otherwise any job on a system can change their > > memory.high and can adversely impact the isolation and memory > > scheduling of the system. > > Is this really the case? There should always be a parent cgroup that > overrides the setting. Can you explain a bit more? I don't see any requirement of having a layer of cgroup between root and the job cgroup. Internally we schedule jobs as top level cgroups. There do exist jobs which are a combination of other jobs and there we do use an additional layer of cgroup (similar to pods running multiple containers in kubernetes). Surely we can add a layer for all the jobs but it comes with an overhead and at scale that overhead is not negligible. > Also you can always set the hard limit if you do > not want to add another layer of cgroup in the hierarchy before > delegation. Or am I missing something? > Yes, we can set memory.max though it has different oom semantics and not really a replacement for memory.high.