Hi Gilles, Gilles Carry wrote: > From: gilles.carry <gilles.carry@xxxxxxxx> > > Symptoms: > System hang (endless loop in plist_check_list) > or BUG because of faulty prev/next pointers in > pushable_task node. > > When push_rt_task successes finding a task to push away, it > performs a double lock on the runqueues (local and target) but > before getting both locks, it releases the local rq lock allowing > other cpus grab the task in between. (eg. pull_rt_task, timers...) > In some situations, when push_rt_task calls dequeue_pushable_task > the task may have already been removed from the pushable_tasks > list by another cpu. Removing the node again corrupts the list. > > This patch adds a sanity check to dequeue_pushable_task which only > removes the node if it's still on the original list. > I think your analysis is spot on, though I am still concerned about your implementation. It still "papers over" the issue, so to speak, which is that the item moved to a new RQ. In addition, it adds a relatively expensive check to the fast path of the list dequeue. Based on this, I have submitted patches that address your finding but in a more targeted and efficient approach (crediting you and Chirag) here: http://lkml.org/lkml/2008/10/3/130 So I have to respectfully NACK this patch, but I do thank you for investigating and pointing me at the actual root problem. It is sincerely appreciated. Regards, -Greg > Signed-off-by: Gilles Carry <gilles.carry@xxxxxxxx> > Cc: ghaskins@xxxxxxxxxx > --- > kernel/sched_rt.c | 18 +++++++++++++++++- > 1 files changed, 17 insertions(+), 1 deletions(-) > > diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c > index 57a0c0d..7aa4450 100644 > --- a/kernel/sched_rt.c > +++ b/kernel/sched_rt.c > @@ -62,7 +62,17 @@ static void enqueue_pushable_task(struct rq *rq, struct task_struct *p) > > static void dequeue_pushable_task(struct rq *rq, struct task_struct *p) > { > - plist_del(&p->pushable_tasks, &rq->rt.pushable_tasks); > + struct plist_node *next; > + > + /* Sanity check: delete only if node is still on this list */ > + plist_for_each(next, &rq->rt.pushable_tasks) { > + if (&p->pushable_tasks == next) { > + plist_del(&p->pushable_tasks, &rq->rt.pushable_tasks); > + return; > + } > + } > + > + > } > > #else > @@ -1105,6 +1115,12 @@ static int push_rt_task(struct rq *rq) > * try again, since the other cpus will pull from us > * when they are ready > */ > + > + /* > + * If we reach here and the task has migrated to another cpu > + * (paranoid == 0?) calling dequeue_pushable_task may cause > + * pushable_tasks list corruption. > + */ > dequeue_pushable_task(rq, next_task); > goto out; > } >
Attachment:
signature.asc
Description: OpenPGP digital signature