Re: [PATCH] rcu: Delay the RCU-selftests during boot.

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Tue, 8 Mar 2022 10:10:40 -0800

On Tue, Mar 08, 2022 at 12:29:47PM +0100, Sebastian Andrzej Siewior wrote:
> On 2022-03-07 13:53:27 [-0800], Paul E. McKenney wrote:
> > > Correct, verified. But this means, that a task within a rcu_read_lock()
> > > section gets preempted for > 26 seconds before that timer fires. That
> > > delay during boot implies that something went wrong while it might
> > > happen at run-time under "normal" circumstances. So I wouldn't try to
> > > get this case covered.
> > 
> > The point being that there is no problem with the timer being queued,
> > just that it won't go off until later in the boot process.  Which in
> > turn means that all that is lost is the RCU CPU stall warning should one
> > be needed.  Is that correct, or am I missing a turn in there somewhere?
> 
> Correct. Before do_pre_smp_initcalls() there is no ksoftirqd so timers
> won't be processed. Therefore even the 26secs timer triggers, there
> will be no RCU-stall warning. Also I'm not sure how much of preemption
> can happen at this point since there is not much going on.
> Any delay at this points to a lockup and at this point the lockup
> detector isn't even fully operational (it does not trigger here during
> test).

Is sysrq operational at this point in RT?  If so, that can be used
to diagnose the hang, at least in development environments.  If the
need to use sysrq becomes a problem, checks/wakeups can be added to
rcu_sched_clock_irq().

> > > > > swait_event_timeout_exclusive() appears innocent.
> > > > 
> > > > I agree that it would rarely need to block, but if the task executing the
> > > > synchronize_rcu() preempted one of the readers, wouldn't it have to block?
> > > > Or am I missing some code path that excludes that possibility?
> > > 
> > > As explained above, it means ~20secs delay during bootup which I don't
> > > see happen. Once ksoftirqd is up, it is covered.
> > > Also: If _more_ users require a timer to expire so the system can
> > > continue to boot I am willing to investigate _why_ this is needed
> > > because it does delay boot up progress of the system.
> > 
> > Just to double-check -- in RT, ksoftirqd is up and running before the boot
> > sequence reaches the call to rcu_end_inkernel_boot() from kernel_init()?
> 
> Yes, this is the case.
> rcu_end_inkernel_boot() is really at the end of the boot process, we
> reached even SYSTEM_RUNNING ;)

Whew!  ;-)

> > This looks good to me, independent of the other discussion.  From what
> > I can see, this code path is infrequent, so using hrtimer all the time
> > should not be a problem.
> 
> Okay.
> 
> > Unless there are objections, I will pull this one in.
> 
> Good. Less patches for me to carry around.
> Thank you.

And here is what I ended up hand-applying, presumably due to differences
between -rt and -rcu.  I also expanded the changelog a bit.  Could you
please check to make sure that I didn't mess something up?

Oh, and I also couldn't resist taking advantage of 100 columns...
What can I say?

							Thanx, Paul

------------------------------------------------------------------------

commit 2895b3bb5f8a0ebe565c62b1d2e3e1efca669962
Author: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx>
Date:   Tue Mar 8 09:54:13 2022 -0800

    rcu-tasks: Use schedule_hrtimeout_range() to wait for grace periods
    
    The synchronous RCU-tasks grace-period-wait primitives invoke
    schedule_timeout_idle() to give readers a chance to exit their
    read-side critical sections.  Unfortunately, this fails during early
    boot on PREEMPT_RT because PREEMPT_RT relies solely on ksoftirqd to run
    timer handlers.  Because ksoftirqd cannot operate until its kthreads
    are spawned, there is a brief period of time following scheduler
    initialization where PREEMPT_RT cannot run the timer handlers that
    schedule_timeout_idle() relies on, resulting in a hang.
    
    To avoid this boot-time hang, this commit replaces schedule_timeout_idle()
    with schedule_hrtimeout(), so that the timer expires in hardirq context.
    This is ensures that the timer fires even on PREEMPT_RT throughout the
    irqs-enabled portions of boot as well as during runtime.
    
    The timer is set to expire between fract and fract + HZ / 2 jiffies in
    order to align with any other timers that might expire during that time,
    thus reducing the number of wakeups.
    
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx>
    Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 79eb280b1cd2a..bbad6b3376947 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -652,13 +652,16 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
 	fract = rtp->init_fract;
 
 	while (!list_empty(&holdouts)) {
+		ktime_t exp;
 		bool firstreport;
 		bool needreport;
 		int rtst;
 
 		// Slowly back off waiting for holdouts
 		set_tasks_gp_state(rtp, RTGS_WAIT_SCAN_HOLDOUTS);
-		schedule_timeout_idle(fract);
+		exp = jiffies_to_nsecs(fract);
+		__set_current_state(TASK_IDLE);
+		schedule_hrtimeout_range(&exp, jiffies_to_nsecs(HZ / 2), HRTIMER_MODE_REL_HARD);
 
 		if (fract < HZ)
 			fract++;