On Tue, 13 Dec 2011 04:01:21 -0800 Paul Turner <pjt@xxxxxxxxxx> wrote: > On 12/12/2011 10:57 PM, Daisuke Nishimura wrote: > > > There is a small race between do_fork() and sched_move_task(), which is trying > > to move the child. > > > > do_fork() sched_move_task() > > --------------------------------+--------------------------------- > > copy_process() > > sched_fork() > > task_fork_fair() > > -> vruntime of the child is initialized > > based on that of the parent. > > > Hmm, so here vruntime of child is actually initialized to vruntime - min_V > > > -> we can see the child in "tasks" file now. > > task_rq_lock() > > task_move_group_fair() > > > So since on a regular fork we just copy and don't actually go through > the attach muck I'm assuming this is an external actor who's seen the > child in the tasks file and is moving it? > Right. > > ->child.se.vruntime -= (old)cfs_rq->min_vruntime > > ->child.se.vruntime += (new)cfs_rq->min_vruntime > > > This would then add delta min_V between the two cfs_rqs > > > task_rq_unlock() > > wake_up_new_task() > > ... > > enqueue_entity() > > child.se->vruntime += cfs_rq->min_vruntime > > > > > As a result, vruntime of the child becomes far bigger than min_vruntime, > > if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime. > > > > This patch fixes this problem by just ignoring such process in task_move_group_fair(), > > because the vruntime has already been normalized in task_fork_fair(). > > > > Signed-off-by: Daisuke Nishimura <nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@xxxxxxxxxxxxxxxx> > > ---This would need an explanatory > > kernel/sched_fair.c | 4 ++-- > > 1 files changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c > > index 5c9e679..df145a9 100644 > > --- a/kernel/sched_fair.c > > +++ b/kernel/sched_fair.c > > @@ -4922,10 +4922,10 @@ static void task_move_group_fair(struct task_struct *p, int on_rq) > > * to another cgroup's rq. This does somewhat interfere with the > > * fair sleeper stuff for the first placement, but who cares. > > */ > > - if (!on_rq) > > + if (!on_rq && p->state != TASK_RUNNING) > > p->se.vruntime -= cfs_rq_of(&p->se)->min_vruntime; > > > We also have the choice of keying off something like > !se.sum_exec_runtime here which is reset in sched_fork() which might > be less fragile/make the problem interaction more obvious to. Either sum_exec_runtime can be used for a process that has just been created, but IIUC it cannot be used for a process that is being woken up from sleeping(in [3/3] case). Or, do you mean using p->se.sum_exec_runtime for [1/3] and p->state for [3/3] ? If so, I'll do it in the next post. > way this is really a corner case deserving of an explanatory comment. > This is a little icky but I don't have any wildly better ideas. > I'll add comments. > > set_task_rq(p, task_cpu(p)); > > - if (!on_rq) > > + if (!on_rq && p->state != TASK_RUNNING) > > p->se.vruntime += cfs_rq_of(&p->se)->min_vruntime; > > } > > #endif > > > Acked-by: Paul Turner <pjt@xxxxxxxxxx> Thanks you for your reviews. Daisuke Nishimura. -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html