Patch "rcu: Defer RCU kthreads wakeup when CPU is dying" has been added to the 6.6-stable tree

Sasha Levin <sashal@xxxxxxxxxx> · Sat, 27 Jan 2024 07:52:55 -0500

This is a note to let you know that I've just added the patch titled

    rcu: Defer RCU kthreads wakeup when CPU is dying

to the 6.6-stable tree which can be found at:
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
     rcu-defer-rcu-kthreads-wakeup-when-cpu-is-dying.patch
and it can be found in the queue-6.6 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@xxxxxxxxxxxxxxx> know about it.



commit 25442abcd36f14a6891e57e93be65b28cb2cb890
Author: Frederic Weisbecker <frederic@xxxxxxxxxx>
Date:   Tue Dec 19 00:19:15 2023 +0100

    rcu: Defer RCU kthreads wakeup when CPU is dying
    
    [ Upstream commit e787644caf7628ad3269c1fbd321c3255cf51710 ]
    
    When the CPU goes idle for the last time during the CPU down hotplug
    process, RCU reports a final quiescent state for the current CPU. If
    this quiescent state propagates up to the top, some tasks may then be
    woken up to complete the grace period: the main grace period kthread
    and/or the expedited main workqueue (or kworker).
    
    If those kthreads have a SCHED_FIFO policy, the wake up can indirectly
    arm the RT bandwith timer to the local offline CPU. Since this happens
    after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the
    timer gets ignored. Therefore if the RCU kthreads are waiting for RT
    bandwidth to be available, they may never be actually scheduled.
    
    This triggers TREE03 rcutorture hangs:
    
             rcu: INFO: rcu_preempt self-detected stall on CPU
             rcu:     4-...!: (1 GPs behind) idle=9874/1/0x4000000000000000 softirq=0/0 fqs=20 rcuc=21071 jiffies(starved)
             rcu:     (t=21035 jiffies g=938281 q=40787 ncpus=6)
             rcu: rcu_preempt kthread starved for 20964 jiffies! g938281 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
             rcu:     Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
             rcu: RCU grace-period kthread stack dump:
             task:rcu_preempt     state:R  running task     stack:14896 pid:14    tgid:14    ppid:2      flags:0x00004000
             Call Trace:
              <TASK>
              __schedule+0x2eb/0xa80
              schedule+0x1f/0x90
              schedule_timeout+0x163/0x270
              ? __pfx_process_timeout+0x10/0x10
              rcu_gp_fqs_loop+0x37c/0x5b0
              ? __pfx_rcu_gp_kthread+0x10/0x10
              rcu_gp_kthread+0x17c/0x200
              kthread+0xde/0x110
              ? __pfx_kthread+0x10/0x10
              ret_from_fork+0x2b/0x40
              ? __pfx_kthread+0x10/0x10
              ret_from_fork_asm+0x1b/0x30
              </TASK>
    
    The situation can't be solved with just unpinning the timer. The hrtimer
    infrastructure and the nohz heuristics involved in finding the best
    remote target for an unpinned timer would then also need to handle
    enqueues from an offline CPU in the most horrendous way.
    
    So fix this on the RCU side instead and defer the wake up to an online
    CPU if it's too late for the local one.
    
    Reported-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
    Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
    Signed-off-by: Frederic Weisbecker <frederic@xxxxxxxxxx>
    Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
    Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@xxxxxxxxx>
    Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 9af42eae1ba3..4fe47ed95eeb 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1013,6 +1013,38 @@ static bool rcu_future_gp_cleanup(struct rcu_node *rnp)
 	return needmore;
 }
 
+static void swake_up_one_online_ipi(void *arg)
+{
+	struct swait_queue_head *wqh = arg;
+
+	swake_up_one(wqh);
+}
+
+static void swake_up_one_online(struct swait_queue_head *wqh)
+{
+	int cpu = get_cpu();
+
+	/*
+	 * If called from rcutree_report_cpu_starting(), wake up
+	 * is dangerous that late in the CPU-down hotplug process. The
+	 * scheduler might queue an ignored hrtimer. Defer the wake up
+	 * to an online CPU instead.
+	 */
+	if (unlikely(cpu_is_offline(cpu))) {
+		int target;
+
+		target = cpumask_any_and(housekeeping_cpumask(HK_TYPE_RCU),
+					 cpu_online_mask);
+
+		smp_call_function_single(target, swake_up_one_online_ipi,
+					 wqh, 0);
+		put_cpu();
+	} else {
+		put_cpu();
+		swake_up_one(wqh);
+	}
+}
+
 /*
  * Awaken the grace-period kthread.  Don't do a self-awaken (unless in an
  * interrupt or softirq handler, in which case we just might immediately
@@ -1037,7 +1069,7 @@ static void rcu_gp_kthread_wake(void)
 		return;
 	WRITE_ONCE(rcu_state.gp_wake_time, jiffies);
 	WRITE_ONCE(rcu_state.gp_wake_seq, READ_ONCE(rcu_state.gp_seq));
-	swake_up_one(&rcu_state.gp_wq);
+	swake_up_one_online(&rcu_state.gp_wq);
 }
 
 /*
diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 8239b39d945b..6e87dc764f47 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -173,7 +173,6 @@ static bool sync_rcu_exp_done_unlocked(struct rcu_node *rnp)
 	return ret;
 }
 
-
 /*
  * Report the exit from RCU read-side critical section for the last task
  * that queued itself during or before the current expedited preemptible-RCU
@@ -201,7 +200,7 @@ static void __rcu_report_exp_rnp(struct rcu_node *rnp,
 			raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 			if (wake) {
 				smp_mb(); /* EGP done before wake_up(). */
-				swake_up_one(&rcu_state.expedited_wq);
+				swake_up_one_online(&rcu_state.expedited_wq);
 			}
 			break;
 		}