On Tue 14-09-21 13:10:04, Vasily Averin wrote: > The kernel currently allows dying tasks to exceed the memcg limits. > The allocation is expected to be the last one and the occupied memory > will be freed soon. > > This is not always true because it can be part of the huge vmalloc > allocation. Allowed once, they will repeat over and over again. > Moreover lifetime of the allocated object can differ from the lifetime > of the dying task. > Multiple such allocations running concurrently can not only overuse > the memcg limit, but can lead to a global out of memory and, > in the worst case, cause the host to panic. > > This patch removes checks forced exceed of the memcg limit for dying > tasks. Also it breaks endless loop for tasks bypassed by the oom killer. > In addition, it renames should_force_charge() helper to task_is_dying() > because now its use do not lead to the forced charge. I would rephrase the changelog as follows to give a broader picture. " Memory cgroup charging allows killed or exiting tasks to exceed the hard limit. It is assumed that the amount of the memory charged by those tasks is bound and most of the memory will get released while the task is exiting. This is resembling a heuristic for the global OOM situation when tasks get access to memory reserves. There is no global memory shortage at the memcg level so the memcg heuristic is more relieved. The above assumption is overly optimistic though. E.g. vmalloc can scale to really large requests and the heuristic would allow that. We used to have an early break in the vmalloc allocator for killed tasks but this has been reverted by b8c8a338f75e (Revert "vmalloc: back off when the current task is killed"). There are likely other similar code paths which do not check for fatal signals in an allocation&charge loop. Also there are some kernel objects charged to a memcg which are not bound to a process life time. It has been observed that it is not really hard to trigger these bypasses and cause global OOM situation. One potential way to address these runaways would be to limit the amount of excess (similar to the global OOM with limited oom reserves). This is certainly possible but it is not really clear how much of an excess is desirable and still protects from global OOMs as that would have to consider the overall memcg configuration. This patch is addressing the problem by removing the heuristic altogether. Bypass is only allowed for requests which either cannot fail or where the failure is not desirable while excess should be still limited (e.g. atomic requests). Implementation wise a killed or dying task fails to charge if it has passed the OOM killer stage. That should give all forms of reclaim chance to restore the limit before the failure (ENOMEM) and tell the caller to back off. " feel free to use parts or whole of it. > Suggested-by: Michal Hocko <mhocko@xxxxxxxx> > Signed-off-by: Vasily Averin <vvs@xxxxxxxxxxxxx> > --- > mm/memcontrol.c | 27 ++++++++------------------- > 1 file changed, 8 insertions(+), 19 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 389b5766e74f..707f6640edda 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -234,7 +234,7 @@ enum res_type { > iter != NULL; \ > iter = mem_cgroup_iter(NULL, iter, NULL)) > > -static inline bool should_force_charge(void) > +static inline bool task_is_dying(void) > { > return tsk_is_oom_victim(current) || fatal_signal_pending(current) || > (current->flags & PF_EXITING); > @@ -1607,7 +1607,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > * A few threads which were not waiting at mutex_lock_killable() can > * fail to bail out. Therefore, check again after holding oom_lock. > */ > - ret = should_force_charge() || out_of_memory(&oc); > + ret = task_is_dying() || out_of_memory(&oc); task_is_dying check will prevent the oom killer for dying tasks. There is an additional bail out at out_of_memory layer. These checks are now leading to a completely different behavior. Currently we simply use "unlimited" reserves and therefore we do not have to kill any task. Now the charge fails without using all reclaim measures. So I believe we should drop those checks for memcg oom paths. I have to think about this some more because I might be missing some other side effects. -- Michal Hocko SUSE Labs