Patch "rcu-tasks: Pull sampling of ->percpu_dequeue_lim out of loop" has been added to the 6.6-stable tree

Sasha Levin <sashal@xxxxxxxxxx> · Sun, 3 Nov 2024 15:25:22 -0500

This is a note to let you know that I've just added the patch titled

    rcu-tasks: Pull sampling of ->percpu_dequeue_lim out of loop

to the 6.6-stable tree which can be found at:
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
     rcu-tasks-pull-sampling-of-percpu_dequeue_lim-out-of.patch
and it can be found in the queue-6.6 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@xxxxxxxxxxxxxxx> know about it.



commit d038a4e204e4af724fb9b20060d1b893a91c7f75
Author: Paul E. McKenney <paulmck@xxxxxxxxxx>
Date:   Wed Aug 2 13:42:00 2023 -0700

    rcu-tasks: Pull sampling of ->percpu_dequeue_lim out of loop
    
    [ Upstream commit e62d8ae4620865411d1b2347980aa28ccf891a3d ]
    
    The rcu_tasks_need_gpcb() samples ->percpu_dequeue_lim as part of the
    condition clause of a "for" loop, which is a bit confusing.  This commit
    therefore hoists this sampling out of the loop, using the result loaded
    in the condition clause.
    
    So why does this work in the face of a concurrent switch from single-CPU
    queueing to per-CPU queueing?
    
    o       The call_rcu_tasks_generic() that makes the change has already
            enqueued its callback, which means that all of the other CPU's
            callback queues are empty.
    
    o       For the call_rcu_tasks_generic() that first notices
            the switch to per-CPU queues, the smp_store_release()
            used to update ->percpu_enqueue_lim pairs with the
            raw_spin_trylock_rcu_node()'s full barrier that is
            between the READ_ONCE(rtp->percpu_enqueue_shift) and the
            rcu_segcblist_enqueue() that enqueues the callback.
    
    o       Because this CPU's queue is empty (unless it happens to
            be the original single queue, in which case there is no
            need for synchronization), this call_rcu_tasks_generic()
            will do an irq_work_queue() to schedule a handler for the
            needed rcuwait_wake_up() call.  This call will be ordered
            after the first call_rcu_tasks_generic() function's change to
            ->percpu_dequeue_lim.
    
    o       This rcuwait_wake_up() will either happen before or after the
            set_current_state() in rcuwait_wait_event().  If it happens
            before, the "condition" argument's call to rcu_tasks_need_gpcb()
            will be ordered after the original change, and all callbacks on
            all CPUs will be visible.  Otherwise, if it happens after, then
            the grace-period kthread's state will be set back to running,
            which will result in a later call to rcuwait_wait_event() and
            thus to rcu_tasks_need_gpcb(), which will again see the change.
    
    So it all works out.
    
    Suggested-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
    Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
    Signed-off-by: Frederic Weisbecker <frederic@xxxxxxxxxx>
    Stable-dep-of: fd70e9f1d85f ("rcu-tasks: Fix access non-existent percpu rtpcp variable in rcu_tasks_need_gpcb()")
    Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index df81506cf2bde..90425d0ec09cf 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -432,6 +432,7 @@ static void rcu_barrier_tasks_generic(struct rcu_tasks *rtp)
 static int rcu_tasks_need_gpcb(struct rcu_tasks *rtp)
 {
 	int cpu;
+	int dequeue_limit;
 	unsigned long flags;
 	bool gpdone = poll_state_synchronize_rcu(rtp->percpu_dequeue_gpseq);
 	long n;
@@ -439,7 +440,8 @@ static int rcu_tasks_need_gpcb(struct rcu_tasks *rtp)
 	long ncbsnz = 0;
 	int needgpcb = 0;
 
-	for (cpu = 0; cpu < smp_load_acquire(&rtp->percpu_dequeue_lim); cpu++) {
+	dequeue_limit = smp_load_acquire(&rtp->percpu_dequeue_lim);
+	for (cpu = 0; cpu < dequeue_limit; cpu++) {
 		struct rcu_tasks_percpu *rtpcp = per_cpu_ptr(rtp->rtpcpu, cpu);
 
 		/* Advance and accelerate any new callbacks. */