On Thu 09-01-14 13:40:10, David Rientjes wrote: > On Thu, 9 Jan 2014, Michal Hocko wrote: > > > Eric has reported that he can see task(s) stuck in memcg OOM handler > > regularly. The only way out is to > > echo 0 > $GROUP/memory.oom_controll > > His usecase is: > > - Setup a hierarchy with memory and the freezer > > (disable kernel oom and have a process watch for oom). > > - In that memory cgroup add a process with one thread per cpu. > > - In one thread slowly allocate once per second I think it is 16M of ram > > and mlock and dirty it (just to force the pages into ram and stay there). > > - When oom is achieved loop: > > * attempt to freeze all of the tasks. > > * if frozen send every task SIGKILL, unfreeze, remove the directory in > > cgroupfs. > > > > Eric has then pinpointed the issue to be memcg specific. > > > > All tasks are sitting on the memcg_oom_waitq when memcg oom is disabled. > > Those that have received fatal signal will bypass the charge and should > > continue on their way out. The tricky part is that the exit path might > > trigger a page fault (e.g. exit_robust_list), thus the memcg charge, > > while its memcg is still under OOM because nobody has released any > > charges yet. > > Unlike with the in-kernel OOM handler the exiting task doesn't get > > TIF_MEMDIE set so it doesn't shortcut futher charges of the killed task > > and falls to the memcg OOM again without any way out of it as there are > > no fatal signals pending anymore. > > > > This patch fixes the issue by checking PF_EXITING early in > > __mem_cgroup_try_charge and bypass the charge same as if it had fatal > > signal pending or TIF_MEMDIE set. > > > > Normally exiting tasks (aka not killed) will bypass the charge now but > > this should be OK as the task is leaving and will release memory and > > increasing the memory pressure just to release it in a moment seems > > dubious wasting of cycles. Besides that charges after exit_signals > > should be rare. > > > > Reported-by: Eric W. Biederman <ebiederm@xxxxxxxxxxxx> > > Signed-off-by: Michal Hocko <mhocko@xxxxxxx> > > Is this tested? By Eric? No AFAIK. I wasn't able to reproduce the issue myself. > > --- > > mm/memcontrol.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index b8dfed1b9d87..b86fbb04b7c6 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm, > > * MEMDIE process. > > */ > > if (unlikely(test_thread_flag(TIF_MEMDIE) > > - || fatal_signal_pending(current))) > > + || fatal_signal_pending(current)) > > + || current->flags & PF_EXITING) > > goto bypass; > > > > if (unlikely(task_in_memcg_oom(current))) > > This would become problematic if significant amount of memory is charged > in the exit() path. But this would hurt also for fatal_signal_pending tasks, wouldn't it? Besides that I do not see any source of allocation after exit_signals. > I don't know of an egregious amount of memory being > allocated and charged after PF_EXITING is set, but if it happens in the > future then this could potentially cause system oom conditions even in > memcg configurations Even if that happens then the global OOM killer would give the exiting task access to memory reserves and wouldn't kill anything else. So I am not sure what problem do you see exactly. Besides that allocating egregious amount of memory after exit_signals sounds fundamentally broken to me. > that are designed such as the one Tejun suggested to > be able to handle such conditions in userspace: > > ___root___ > / \ > user oom > / \ / \ > A B C D > > where the limit of user is equal to the amount of system memory minus > whatever amount of memory is needed by the system oom handler attached as > a descendant of oom and still allows the limits of A + B to exceed the > limit of user. > > So how do we ensure that memory allocations in the exit() path don't cause > system oom conditions whereas the above configuration no longer provides > any strict guarantee? > > Thanks. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>