Patch "sched/rt: Fix live lock between select_fallback_rq() and RT push" has been added to the 6.1-stable tree

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This is a note to let you know that I've just added the patch titled

    sched/rt: Fix live lock between select_fallback_rq() and RT push

to the 6.1-stable tree which can be found at:
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
     sched-rt-fix-live-lock-between-select_fallback_rq-and-rt-push.patch
and it can be found in the queue-6.1 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@xxxxxxxxxxxxxxx> know about it.


>From fc09027786c900368de98d03d40af058bcb01ad9 Mon Sep 17 00:00:00 2001
From: "Joel Fernandes (Google)" <joel@xxxxxxxxxxxxxxxxx>
Date: Sat, 23 Sep 2023 01:14:08 +0000
Subject: sched/rt: Fix live lock between select_fallback_rq() and RT push

From: Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx>

commit fc09027786c900368de98d03d40af058bcb01ad9 upstream.

During RCU-boost testing with the TREE03 rcutorture config, I found that
after a few hours, the machine locks up.

On tracing, I found that there is a live lock happening between 2 CPUs.
One CPU has an RT task running, while another CPU is being offlined
which also has an RT task running.  During this offlining, all threads
are migrated. The migration thread is repeatedly scheduled to migrate
actively running tasks on the CPU being offlined. This results in a live
lock because select_fallback_rq() keeps picking the CPU that an RT task
is already running on only to get pushed back to the CPU being offlined.

It is anyway pointless to pick CPUs for pushing tasks to if they are
being offlined only to get migrated away to somewhere else. This could
also add unwanted latency to this task.

Fix these issues by not selecting CPUs in RT if they are not 'active'
for scheduling, using the cpu_active_mask. Other parts in core.c already
use cpu_active_mask to prevent tasks from being put on CPUs going
offline.

With this fix I ran the tests for days and could not reproduce the
hang. Without the patch, I hit it in a few hours.

Signed-off-by: Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx>
Signed-off-by: Ingo Molnar <mingo@xxxxxxxxxx>
Tested-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
Cc: stable@xxxxxxxxxxxxxxx
Link: https://lore.kernel.org/r/20230923011409.3522762-1-joel@xxxxxxxxxxxxxxxxx
Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
---
 kernel/sched/cpupri.c |    1 +
 1 file changed, 1 insertion(+)

--- a/kernel/sched/cpupri.c
+++ b/kernel/sched/cpupri.c
@@ -101,6 +101,7 @@ static inline int __cpupri_find(struct c
 
 	if (lowest_mask) {
 		cpumask_and(lowest_mask, &p->cpus_mask, vec->mask);
+		cpumask_and(lowest_mask, lowest_mask, cpu_active_mask);
 
 		/*
 		 * We have to ensure that we have at least one bit


Patches currently in stable-queue which might be from joel@xxxxxxxxxxxxxxxxx are

queue-6.1/sched-rt-fix-live-lock-between-select_fallback_rq-and-rt-push.patch



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux