Re: [BUGFIX][PATCH] memcg: fix oom killer kills a task in other cgroup

Minchan Kim <minchan.kim@xxxxxxxxx> · Tue, 9 Feb 2010 10:24:45 +0900

On Tue, Feb 9, 2010 at 9:32 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote:
> On Sat, 6 Feb 2010 01:30:49 +0900
> Minchan Kim <minchan.kim@xxxxxxxxx> wrote:
>
>> Hi, Kame.
>>
>> On Fri, Feb 5, 2010 at 9:39 AM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote:
>> > Please take this patch in different context with recent discussion.
>> > This is a quick-fix for a terrible bug.
>> >
>> > This patch itself is against mmotm but can be easily applied to mainline or
>> > stable tree, I think. (But I don't CC stable tree until I get ack.)
>> >
>> > ==
>> > Now, oom-killer kills process's chidlren at first. But this means
>> > a child in other cgroup can be killed. But it's not checked now.
>> >
>> > This patch fixes that.
>> >
>> > CC: Balbir Singh <balbir@xxxxxxxxxxxxxxxxxx>
>> > CC: Daisuke Nishimura <nishimura@xxxxxxxxxxxxxxxxx>
>> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
>> > ---
>> >  mm/oom_kill.c |    3 +++
>> >  1 file changed, 3 insertions(+)
>> >
>> > Index: mmotm-2.6.33-Feb03/mm/oom_kill.c
>> > ===================================================================
>> > --- mmotm-2.6.33-Feb03.orig/mm/oom_kill.c
>> > +++ mmotm-2.6.33-Feb03/mm/oom_kill.c
>> > @@ -459,6 +459,9 @@ static int oom_kill_process(struct task_
>> >        list_for_each_entry(c, &p->children, sibling) {
>> >                if (c->mm == p->mm)
>> >                        continue;
>> > +               /* Children may be in other cgroup */
>> > +               if (mem && !task_in_mem_cgroup(c, mem))
>> > +                       continue;
>> >                if (!oom_kill_task(c))
>> >                        return 0;
>> >        }
>> >
>> > --
>>
>> I am worried about latency of OOM at worst case.
>> I mean that task_in_mem_cgroup calls task_lock of child.
>> We have used task_lock in many place.
>> Some place task_lock hold and then other locks.
>> For example, exit_fs held task_lock and try to hold write_lock of fs->lock.
>> If child already hold task_lock and wait to write_lock of fs->lock, OOM latency
>> is dependent of fs->lock.
>>
>> I am not sure how many usecase is also dependent of other locks.
>> If it is not as is, we can't make sure in future.
>>
>> So How about try_task_in_mem_cgroup?
>> If we can't hold task_lock, let's continue next child.
>>
> It's recommended not to use trylock in unclear case.
>
> Then, I think possible replacement will be not-to-use any lock in
> task_in_mem_cgroup. In my short consideration, I don't think task_lock
> is necessary if we can add some tricks and memory barrier.
>
> Please let this patch to go as it is because this is an obvious bug fix
> and give me time.

I think it's not only a latency problem of OOM but it is also a
problem of deadlock.
We can't expect child's lock state in oom_kill_process.

So if you can remove lock like below your suggestion, I am OKAY.

>
> Now, I think of following.
> This makes use of the fact mm->owner is changed only at _exit() of the owner.
> If there is a race with _exit() and mm->owner is racy, the oom selection
> itself was racy and bad.

It seems to make sense to me.

> ==
> int task_in_mem_cgroup_oom(struct task_struct *tsk, struct mem_cgroup *mem)
> {
>        struct mm_struct *mm;
>        struct task_struct *tsk;
>        int ret = 0;
>
>        mm = tsk->mm;
>        if (!mm)
>                return ret;
>        /*
>         * we are not interested in tasks other than owner. mm->owner is
>         * updated when the owner task exits. If the owner is exiting now
>         * (and race with us), we may miss.
>         */
>        if (rcu_dereference(mm->owner) != tsk)
>                return ret;

Yes. In this case, OOM killer can wait a few seconds until this task is exited.
If we don't do that, we could kill other innocent task.

>        rcu_read_lock();
>        /* while this task is alive, this task is the owner */
>        if (mem == mem_cgroup_from_task(tsk))
>                ret = 1;
>        rcu_read_unlock();
>        return ret;
> }
> ==
> Hmm, it seems no memory barrier is necessary.
>
> Does anyone has another idea ?
>
> Thanks,
> -Kame
>
>
>
>
>
>
>
>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href