Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

Leonardo Bras <leobras@xxxxxxxxxx> · Thu, 9 May 2024 07:14:18 -0300

On Thu, May 09, 2024 at 05:16:57AM -0300, Leonardo Bras wrote:
> On Wed, May 08, 2024 at 08:32:40PM -0700, Paul E. McKenney wrote:
> > On Wed, May 08, 2024 at 07:01:29AM -0700, Sean Christopherson wrote:
> > > On Wed, May 08, 2024, Leonardo Bras wrote:
> > > > Something just hit me, and maybe I need to propose something more generic.
> > > 
> > > Yes.  This is what I was trying to get across with my complaints about keying off
> > > of the last VM-Exit time.  It's effectively a broad stroke "this task will likely
> > > be quiescent soon" and so the core concept/functionality belongs in common code,
> > > not KVM.
> > 
> > OK, we could do something like the following wholly within RCU, namely
> > to make rcu_pending() refrain from invoking rcu_core() until the grace
> > period is at least the specified age, defaulting to zero (and to the
> > current behavior).
> > 
> > Perhaps something like the patch shown below.
> 
> That's exactly what I was thinking :)
> 
> > 
> > Thoughts?
> 
> Some suggestions below:
> 
> > 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > commit abc7cd2facdebf85aa075c567321589862f88542
> > Author: Paul E. McKenney <paulmck@xxxxxxxxxx>
> > Date:   Wed May 8 20:11:58 2024 -0700
> > 
> >     rcu: Add rcutree.nocb_patience_delay to reduce nohz_full OS jitter
> >     
> >     If a CPU is running either a userspace application or a guest OS in
> >     nohz_full mode, it is possible for a system call to occur just as an
> >     RCU grace period is starting.  If that CPU also has the scheduling-clock
> >     tick enabled for any reason (such as a second runnable task), and if the
> >     system was booted with rcutree.use_softirq=0, then RCU can add insult to
> >     injury by awakening that CPU's rcuc kthread, resulting in yet another
> >     task and yet more OS jitter due to switching to that task, running it,
> >     and switching back.
> >     
> >     In addition, in the common case where that system call is not of
> >     excessively long duration, awakening the rcuc task is pointless.
> >     This pointlessness is due to the fact that the CPU will enter an extended
> >     quiescent state upon returning to the userspace application or guest OS.
> >     In this case, the rcuc kthread cannot do anything that the main RCU
> >     grace-period kthread cannot do on its behalf, at least if it is given
> >     a few additional milliseconds (for example, given the time duration
> >     specified by rcutree.jiffies_till_first_fqs, give or take scheduling
> >     delays).
> >     
> >     This commit therefore adds a rcutree.nocb_patience_delay kernel boot
> >     parameter that specifies the grace period age (in milliseconds)
> >     before which RCU will refrain from awakening the rcuc kthread.
> >     Preliminary experiementation suggests a value of 1000, that is,
> >     one second.  Increasing rcutree.nocb_patience_delay will increase
> >     grace-period latency and in turn increase memory footprint, so systems
> >     with constrained memory might choose a smaller value.  Systems with
> >     less-aggressive OS-jitter requirements might choose the default value
> >     of zero, which keeps the traditional immediate-wakeup behavior, thus
> >     avoiding increases in grace-period latency.
> >     
> >     Link: https://lore.kernel.org/all/20240328171949.743211-1-leobras@xxxxxxxxxx/
> >     
> >     Reported-by: Leonardo Bras <leobras@xxxxxxxxxx>
> >     Suggested-by: Leonardo Bras <leobras@xxxxxxxxxx>
> >     Suggested-by: Sean Christopherson <seanjc@xxxxxxxxxx>
> >     Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
> > 
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 0a3b0fd1910e6..42383986e692b 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -4981,6 +4981,13 @@
> >  			the ->nocb_bypass queue.  The definition of "too
> >  			many" is supplied by this kernel boot parameter.
> >  
> > +	rcutree.nocb_patience_delay= [KNL]
> > +			On callback-offloaded (rcu_nocbs) CPUs, avoid
> > +			disturbing RCU unless the grace period has
> > +			reached the specified age in milliseconds.
> > +			Defaults to zero.  Large values will be capped
> > +			at five seconds.
> > +
> >  	rcutree.qhimark= [KNL]
> >  			Set threshold of queued RCU callbacks beyond which
> >  			batch limiting is disabled.
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 7560e204198bb..6e4b8b43855a0 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -176,6 +176,8 @@ static int gp_init_delay;
> >  module_param(gp_init_delay, int, 0444);
> >  static int gp_cleanup_delay;
> >  module_param(gp_cleanup_delay, int, 0444);
> > +static int nocb_patience_delay;
> > +module_param(nocb_patience_delay, int, 0444);
> >  
> >  // Add delay to rcu_read_unlock() for strict grace periods.
> >  static int rcu_unlock_delay;
> > @@ -4334,6 +4336,8 @@ EXPORT_SYMBOL_GPL(cond_synchronize_rcu_full);
> >  static int rcu_pending(int user)
> >  {
> >  	bool gp_in_progress;
> > +	unsigned long j = jiffies;
> 
> I think this is probably taken care by the compiler, but just in case I would move the 
> j = jiffies;
> closer to it's use, in order to avoid reading 'jiffies' if rcu_pending 
> exits before the nohz_full testing.
> 
> 
> > +	unsigned int patience = msecs_to_jiffies(nocb_patience_delay);
> 
> What do you think on processsing the new parameter in boot, and saving it 
> in terms of jiffies already? 
> 
> It would make it unnecessary to convert ms -> jiffies every time we run 
> rcu_pending.
> 
> (OOO will probably remove the extra division, but may cause less impact in 
> some arch)
> 
> >  	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
> >  	struct rcu_node *rnp = rdp->mynode;
> >  
> > @@ -4347,11 +4351,13 @@ static int rcu_pending(int user)
> >  		return 1;
> >  
> >  	/* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if so.) */
> > -	if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
> > +	gp_in_progress = rcu_gp_in_progress();
> > +	if ((user || rcu_is_cpu_rrupt_from_idle() ||
> > +	     (gp_in_progress && time_before(j + patience, rcu_state.gp_start))) &&
> 
> I think you meant:
> 	time_before(j, rcu_state.gp_start + patience)
> 
> or else this always fails, as we can never have now to happen before a 
> previously started gp, right?
> 
> Also, as per rcu_nohz_full_cpu() we probably need it to be read with 
> READ_ONCE():
> 
> 	time_before(j, READ_ONCE(rcu_state.gp_start) + patience)
> 
> > +	    rcu_nohz_full_cpu())
> >  		return 0;
> >  
> >  	/* Is the RCU core waiting for a quiescent state from this CPU? */
> > -	gp_in_progress = rcu_gp_in_progress();
> >  	if (rdp->core_needs_qs && !rdp->cpu_no_qs.b.norm && gp_in_progress)
> >  		return 1;
> >  
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index 340bbefe5f652..174333d0e9507 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -93,6 +93,15 @@ static void __init rcu_bootup_announce_oddness(void)
> >  		pr_info("\tRCU debug GP init slowdown %d jiffies.\n", gp_init_delay);
> >  	if (gp_cleanup_delay)
> >  		pr_info("\tRCU debug GP cleanup slowdown %d jiffies.\n", gp_cleanup_delay);
> > +	if (nocb_patience_delay < 0) {
> > +		pr_info("\tRCU NOCB CPU patience negative (%d), resetting to zero.\n", nocb_patience_delay);
> > +		nocb_patience_delay = 0;
> > +	} else if (nocb_patience_delay > 5 * MSEC_PER_SEC) {
> > +		pr_info("\tRCU NOCB CPU patience too large (%d), resetting to %ld.\n", nocb_patience_delay, 5 * MSEC_PER_SEC);
> > +		nocb_patience_delay = 5 * MSEC_PER_SEC;
> > +	} else if (nocb_patience_delay) {
> 
> Here you suggest that we don't print if 'nocb_patience_delay == 0', 
> as it's the default behavior, right?
> 
> I think printing on 0 could be useful to check if the feature exists, even 
> though we are zeroing it, but this will probably add unnecessary verbosity.
> 
> > +		pr_info("\tRCU NOCB CPU patience set to %d milliseconds.\n", nocb_patience_delay);
> > +	}
> 
> Here I suppose something like this can take care of not needing to convert 
> ms -> jiffies every rcu_pending():
> 
> +	nocb_patience_delay = msecs_to_jiffies(nocb_patience_delay);
> 

Uh, there is more to it, actually. We need to make sure the user 
understands that we are rounding-down the value to multiple of a jiffy 
period, so it's not a surprise if the delay value is not exactly the same 
as the passed on kernel cmdline.

So something like bellow diff should be ok, as this behavior is explained 
in the docs, and pr_info() will print the effective value.

What do you think?

Thanks!
Leo

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 0a3b0fd1910e..9a50be9fd9eb 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4974,20 +4974,28 @@
                        otherwise be caused by callback floods through
                        use of the ->nocb_bypass list.  However, in the
                        common non-flooded case, RCU queues directly to
                        the main ->cblist in order to avoid the extra
                        overhead of the ->nocb_bypass list and its lock.
                        But if there are too many callbacks queued during
                        a single jiffy, RCU pre-queues the callbacks into
                        the ->nocb_bypass queue.  The definition of "too
                        many" is supplied by this kernel boot parameter.
 
+       rcutree.nocb_patience_delay= [KNL]
+                       On callback-offloaded (rcu_nocbs) CPUs, avoid
+                       disturbing RCU unless the grace period has
+                       reached the specified age in milliseconds.
+                       Defaults to zero.  Large values will be capped
+                       at five seconds. Values rounded-down to a multiple
+                       of a jiffy period.
+
        rcutree.qhimark= [KNL]
                        Set threshold of queued RCU callbacks beyond which
                        batch limiting is disabled.
 
        rcutree.qlowmark= [KNL]
                        Set threshold of queued RCU callbacks below which
                        batch limiting is re-enabled.
 
        rcutree.qovld= [KNL]
                        Set threshold of queued RCU callbacks beyond which
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index fcf2b4aa3441..62ede401420f 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -512,20 +512,21 @@ do {                                                              \
        local_irq_save(flags);                                  \
        if (rcu_segcblist_is_offloaded(&(rdp)->cblist)) \
                raw_spin_lock(&(rdp)->nocb_lock);               \
 } while (0)
 #else /* #ifdef CONFIG_RCU_NOCB_CPU */
 #define rcu_nocb_lock_irqsave(rdp, flags) local_irq_save(flags)
 #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
 
 static void rcu_bind_gp_kthread(void);
 static bool rcu_nohz_full_cpu(void);
+static bool rcu_on_patience_delay(void);
 
 /* Forward declarations for tree_stall.h */
 static void record_gp_stall_check_time(void);
 static void rcu_iw_handler(struct irq_work *iwp);
 static void check_cpu_stall(struct rcu_data *rdp);
 static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp,
                                     const unsigned long gpssdelay);
 
 /* Forward declarations for tree_exp.h. */
 static void sync_rcu_do_polled_gp(struct work_struct *wp);
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 340bbefe5f65..639243b0410f 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -5,20 +5,21 @@
  * or preemptible semantics.
  *
  * Copyright Red Hat, 2009
  * Copyright IBM Corporation, 2009
  *
  * Author: Ingo Molnar <mingo@xxxxxxx>
  *        Paul E. McKenney <paulmck@xxxxxxxxxxxxx>
  */
 
 #include "../locking/rtmutex_common.h"
+#include <linux/jiffies.h>
 
 static bool rcu_rdp_is_offloaded(struct rcu_data *rdp)
 {
        /*
         * In order to read the offloaded state of an rdp in a safe
         * and stable way and prevent from its value to be changed
         * under us, we must either hold the barrier mutex, the cpu
         * hotplug lock (read or write) or the nocb lock. Local
         * non-preemptible reads are also safe. NOCB kthreads and
         * timers have their own means of synchronization against the
@@ -86,20 +87,33 @@ static void __init rcu_bootup_announce_oddness(void)
        if (rcu_kick_kthreads)
                pr_info("\tKick kthreads if too-long grace period.\n");
        if (IS_ENABLED(CONFIG_DEBUG_OBJECTS_RCU_HEAD))
                pr_info("\tRCU callback double-/use-after-free debug is enabled.\n");
        if (gp_preinit_delay)
                pr_info("\tRCU debug GP pre-init slowdown %d jiffies.\n", gp_preinit_delay);
        if (gp_init_delay)
                pr_info("\tRCU debug GP init slowdown %d jiffies.\n", gp_init_delay);
        if (gp_cleanup_delay)
                pr_info("\tRCU debug GP cleanup slowdown %d jiffies.\n", gp_cleanup_delay);
+       if (nocb_patience_delay < 0) {
+               pr_info("\tRCU NOCB CPU patience negative (%d), resetting to zero.\n",
+                       nocb_patience_delay);
+               nocb_patience_delay = 0;
+       } else if (nocb_patience_delay > 5 * MSEC_PER_SEC) {
+               pr_info("\tRCU NOCB CPU patience too large (%d), resetting to %ld.\n",
+                       nocb_patience_delay, 5 * MSEC_PER_SEC);
+               nocb_patience_delay = msecs_to_jiffies(5 * MSEC_PER_SEC);
+       } else if (nocb_patience_delay) {
+               nocb_patience_delay = msecs_to_jiffies(nocb_patience_delay);
+               pr_info("\tRCU NOCB CPU patience set to %d milliseconds.\n",
+                       jiffies_to_msecs(nocb_patience_delay);
+       }
        if (!use_softirq)
                pr_info("\tRCU_SOFTIRQ processing moved to rcuc kthreads.\n");
        if (IS_ENABLED(CONFIG_RCU_EQS_DEBUG))
                pr_info("\tRCU debug extended QS entry/exit.\n");
        rcupdate_announce_bootup_oddness();
 }
 
 #ifdef CONFIG_PREEMPT_RCU
 
 static void rcu_report_exp_rnp(struct rcu_node *rnp, bool wake);
@@ -1260,10 +1274,29 @@ static bool rcu_nohz_full_cpu(void)
 
 /*
  * Bind the RCU grace-period kthreads to the housekeeping CPU.
  */
 static void rcu_bind_gp_kthread(void)
 {
        if (!tick_nohz_full_enabled())
                return;
        housekeeping_affine(current, HK_TYPE_RCU);
 }
+
+/*
+ * Is this CPU a NO_HZ_FULL CPU that should ignore RCU if the time since the
+ * start of current grace period is smaller than nocb_patience_delay ?
+ *
+ * This code relies on the fact that all NO_HZ_FULL CPUs are also
+ * RCU_NOCB_CPU CPUs.
+ */
+static bool rcu_on_patience_delay(void)
+{
+#ifdef CONFIG_NO_HZ_FULL
+       if (!nocb_patience_delay)
+               return false;
+
+       if (time_before(jiffies, READ_ONCE(rcu_state.gp_start) + nocb_patience_delay))
+               return true;
+#endif /* #ifdef CONFIG_NO_HZ_FULL */
+       return false;
+}
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 7560e204198b..7a2d94370ab4 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -169,20 +169,22 @@ static int kthread_prio = IS_ENABLED(CONFIG_RCU_BOOST) ? 1 : 0;
 module_param(kthread_prio, int, 0444);
 
 /* Delay in jiffies for grace-period initialization delays, debug only. */
 
 static int gp_preinit_delay;
 module_param(gp_preinit_delay, int, 0444);
 static int gp_init_delay;
 module_param(gp_init_delay, int, 0444);
 static int gp_cleanup_delay;
 module_param(gp_cleanup_delay, int, 0444);
+static int nocb_patience_delay;
+module_param(nocb_patience_delay, int, 0444);
 
 // Add delay to rcu_read_unlock() for strict grace periods.
 static int rcu_unlock_delay;
 #ifdef CONFIG_RCU_STRICT_GRACE_PERIOD
 module_param(rcu_unlock_delay, int, 0444);
 #endif
 
 /*
  * This rcu parameter is runtime-read-only. It reflects
  * a minimum allowed number of objects which can be cached
@@ -4340,25 +4342,27 @@ static int rcu_pending(int user)
        lockdep_assert_irqs_disabled();
 
        /* Check for CPU stalls, if enabled. */
        check_cpu_stall(rdp);
 
        /* Does this CPU need a deferred NOCB wakeup? */
        if (rcu_nocb_need_deferred_wakeup(rdp, RCU_NOCB_WAKE))
                return 1;
 
        /* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if so.) */
-       if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
+       gp_in_progress = rcu_gp_in_progress();
+       if ((user || rcu_is_cpu_rrupt_from_idle() ||
+            (gp_in_progress && rcu_on_patience_delay())) &&
+           rcu_nohz_full_cpu())
                return 0;
 
        /* Is the RCU core waiting for a quiescent state from this CPU? */
-       gp_in_progress = rcu_gp_in_progress();
        if (rdp->core_needs_qs && !rdp->cpu_no_qs.b.norm && gp_in_progress)
                return 1;
 
        /* Does this CPU have callbacks ready to invoke? */
        if (!rcu_rdp_is_offloaded(rdp) &&
            rcu_segcblist_ready_cbs(&rdp->cblist))
                return 1;
 
        /* Has RCU gone idle with this CPU needing another grace period? */
        if (!gp_in_progress && rcu_segcblist_is_enabled(&rdp->cblist) &&