It seems that this one fell though the cracks? On Thu 01-08-13 11:06:20, Michal Hocko wrote: > On Wed 31-07-13 15:09:16, Eric W. Biederman wrote: > > Michal Hocko <mhocko@xxxxxxx> writes: > > > > > [I am CCing David here as well] > > > > > > On Tue 30-07-13 09:37:46, Eric W. Biederman wrote: > > >> Michal Hocko <mhocko@xxxxxxx> writes: > > >> > > >> > On Tue 30-07-13 01:19:31, Eric W. Biederman wrote: > > >> > [...] > > >> >> Hmm. Looking farther I see what is going on. And it has nothing to do > > >> >> with the freezer. (I have commented out that code and reproduced it > > >> >> without the freezer to be doubly certain). > > >> >> > > >> >> > > >> >> On the exit path exit_robust_list is triggering a page fault to fault a > > >> >> page back in. Which since we have no memory causes the exit path > > >> >> to get stuck in mem_cgroup_handle_oom. > > >> > > > >> > Hmm, interesting. I assume the exit is caused by the SIGKILL, right? > > >> > If yes, then why it hasn't coughed early in __mem_cgroup_try_charge > > >> > > >> Interesting question. This isn't the primary thread but we do send > > >> SIGKILL to the secondary threads as well. > > >> > > >> We definitely need those checks on both paths making my change valid. > > >> > > >> Oh. Duh! This is after we act on SIGKILL so SIGKILL is no longer > > >> pending. > > > > > > Very well spotted Eric! What do you think about the following patch? > > > I would have to check since when the exit path could trigger the fault > > > but I guess this is worth stable backport. > > > > It doesn't have a prayer of working. > > So it hasn't passed your test? > > > You leave open the race of a fatal signal being received before we go to > > sleep. > > If a fatal signal is received before we're going to sleep then > schedule() should keep it on the runqueue, no? > > static void __sched __schedule(void) > { > [...] > if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) { > if (unlikely(signal_pending_state(prev->state, prev))) { > prev->state = TASK_RUNNING; > } else { > > so it should get a timeslice eventually, mem_cgroup_handle_oom sees > fatal_signal_pending and sets TIF_MEMDIE, bypass the charge, get to > signal handling, start exiting, fault in, get to charge and bail out in > __mem_cgroup_try_charge because it sees TIF_MEMDIE. > > Or what am I missing? > > > You don't handle a task that has processed the fatal signal and is in > > PF_EXITING. Which is what I experienced. > > > > From earlier comments about my code not being early enough I thought I > > was going to see a patch in __mem_cgroup_try_change so that the bypass > > case will kick in also for tasks in PF_EXITING. > > This shouldn't be necessary because TIF_MEMDIE was set for the killed > task. I was playing with PF_EXITING there as well but TIF_MEMDIE sounds > like a more appropriate solution. > > > You change actually addresses things later in the code path than mine > > does. > > > > I do like your summary of the problem. > > > > Eric > > > > > --- > > > From 411408558f2858328ea25e69567e9a53a8314032 Mon Sep 17 00:00:00 2001 > > > From: Michal Hocko <mhocko@xxxxxxx> > > > Date: Wed, 31 Jul 2013 08:48:54 +0200 > > > Subject: [PATCH] memcg: Do not hang on OOM when killed by userspace OOM > > > > > > Eric has reported that he can see task(s) stuck in memcg OOM handler > > > regularly. The only way out is to > > > echo 0 > $GROUP/memory.oom_controll > > > > > > His usecase is: > > > - Setup a hierarchy with memory and the freezer > > > (disable kernel oom and have a process watch for oom). > > > - In that memory cgroup add a process with one thread per cpu. > > > - In one thread slowly allocate once per second I think it is 16M of ram > > > and mlock and dirty it (just to force the pages into ram and stay there). > > > - When oom is achieved loop: > > > * attempt to freeze all of the tasks. > > > * if frozen send every task SIGKILL, unfreeze, remove the directory in > > > cgroupfs. > > > > > > Eric has then pinpointed the issue to be memcg specific. > > > > > > All tasks are sitting on the memcg_oom_waitq when memcg oom is disabled. > > > Those that have received fatal signal will bypass the charge and should > > > continue on their way out. The tricky part is that that exit path might > > > trigger a page fault (e.g. exit_robust_list) thus the memcg charge > > > while its memcg is still under OOM because nobody has released any > > > charges. Unlike with the in-kernel OOM handler the exiting task doesn't > > > get TIF_MEMDIE set so it doesn't shortcut charges and falls to the > > > memcg OOM again without any way out of it as there are no fatal signals > > > pending anymore. > > > > > > This patch sets the TIF_MEMDIE flag pro actively in mem_cgroup_handle_oom > > > if the memcg is disabled after the task is woken up with fatal signal > > > pending. This means that any further charges will be bypassed early in > > > __mem_cgroup_try_charge and the task will have chance to exit finally. > > > > > > Strictly speaking we might mark also a task which hasn't been killed by > > > userspace OOM handler but this is not harmful as the task is going away > > > anyway and under-oom group would like to see it go as soon as possible. > > > > > > Reported-by: Eric W. Biederman <ebiederm@xxxxxxxxxxxx> > > > Debugged-by: Eric W. Biederman <ebiederm@xxxxxxxxxxxx> > > > Signed-off-by: Michal Hocko <mhocko@xxxxxxx> > > > --- > > > mm/memcontrol.c | 13 ++++++++++++- > > > 1 file changed, 12 insertions(+), 1 deletion(-) > > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > index d12ca6f..d4103b0 100644 > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -2235,8 +2235,19 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask, > > > > > > mem_cgroup_unmark_under_oom(memcg); > > > > > > - if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) > > > + if (test_thread_flag(TIF_MEMDIE)) > > > return false; > > > + > > > + /* > > > + * Userspace OOM killer might have killed this task but > > > + * there is no way it could have set TIF_MEMDIE as well > > > + * so we have to set it manually. > > > + */ > > > + if (fatal_signal_pending(current)) { > > > + if (memcg->oom_kill_disable) > > > + set_thread_flag(TIF_MEMDIE); > > > + return false; > > > + } > > > /* Give chance to dying process */ > > > schedule_timeout_uninterruptible(1); > > > return true; > > > -- > > > 1.8.3.2 > > -- > > To unsubscribe from this list: send the line "unsubscribe cgroups" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > Michal Hocko > SUSE Labs > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html