Hi Gilles Gilles Carry wrote: > From: gilles.carry <gilles.carry@xxxxxxxx> > > Symptoms: > System hang (endless loop in plist_check_list) or BUG because > of faulty prev/next pointers in pushable_task node. > > > When push_rt_task successes finding a task to push away, it > performs a double lock on the runqueues (local and target) but > before getting both locks, it releases the local rq lock letting > other cpus grab the task in between. (eg. pull_rt_task, timers...) > When push_rt_task calls deactivate_task (which calls > dequeue_pushable_task) the task may have already been removed > from the pushable_tasks list by another cpu. > Removing the node again corrupts the list. > Hmm, I was looking at this same area of the code earlier this week. The problem with your assessment is that find_lock_lowest_rq() already accounts for the dropped-lock-migration and will return NULL if the task was moved in the interim. I suppose there could be some weird circumstance where the task is moved away, and then moved back, but even so plist_del() is supposed to be idempotent, so I dont see why an extra dequeue_pushable itself would be a problem. At this point I don't really *love* your patch because it seems to just be plastering over the problem that the list is corrupted. I do appreciate that you are looking at this problem, however! So thank you for that and please keep it up. I am on vacation every thursday+friday for a while, so I will not be responsive until Monday. Ill catch up with you guys then. Have a good weekend. -Greg
Attachment:
signature.asc
Description: OpenPGP digital signature