CPU hangs were reported while offlining/onlining CPUs on s390. Analyzing the vmcore data shows `stop_one_cpu_nowait()` in `affine_move_task()` can fail when racing with off-/on-lining resulting in a deadlock waiting for the pending migration stop work completion which is never done. Fix this by gracefully handling such condition. Fixes: 9e81889c7648 ("sched: Fix affine_move_task() self-concurrency") Cc: stable@xxxxxxxxxxxxxxx Reported-by: Bill Peters <wpeters@xxxxxxxxx> Tested-by: Bill Peters <wpeters@xxxxxxxxx> Signed-off-by: Daniel Vacek <neelx@xxxxxxxxxx> --- kernel/sched/core.c | 21 +++++++++++++++++++-- 1 file changed, 19 insertions(+), 2 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f3951e4a55e5b..40a3c9ff74077 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2871,8 +2871,25 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag preempt_disable(); task_rq_unlock(rq, p, rf); if (!stop_pending) { - stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop, - &pending->arg, &pending->stop_work); + stop_pending = + stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop, + &pending->arg, &pending->stop_work); + /* + * The state resulting in this failure is not expected + * at this point. At least report a WARNING to be able + * to panic and further debug if reproduced. + */ + if (WARN_ON(!stop_pending)) { + /* + * Then try to handle the failure gracefully + * to prevent the deadlock a few lines later. + */ + rq = task_rq_lock(p, rf); + pending->stop_pending = false; + p->migration_pending = NULL; + task_rq_unlock(rq, p, rf); + complete_all(&pending->done); + } } preempt_enable(); -- 2.43.0