On Tue, Mar 08, 2022 at 12:29:47PM +0100, Sebastian Andrzej Siewior wrote: > On 2022-03-07 13:53:27 [-0800], Paul E. McKenney wrote: > > > Correct, verified. But this means, that a task within a rcu_read_lock() > > > section gets preempted for > 26 seconds before that timer fires. That > > > delay during boot implies that something went wrong while it might > > > happen at run-time under "normal" circumstances. So I wouldn't try to > > > get this case covered. > > > > The point being that there is no problem with the timer being queued, > > just that it won't go off until later in the boot process. Which in > > turn means that all that is lost is the RCU CPU stall warning should one > > be needed. Is that correct, or am I missing a turn in there somewhere? > > Correct. Before do_pre_smp_initcalls() there is no ksoftirqd so timers > won't be processed. Therefore even the 26secs timer triggers, there > will be no RCU-stall warning. Also I'm not sure how much of preemption > can happen at this point since there is not much going on. > Any delay at this points to a lockup and at this point the lockup > detector isn't even fully operational (it does not trigger here during > test). Is sysrq operational at this point in RT? If so, that can be used to diagnose the hang, at least in development environments. If the need to use sysrq becomes a problem, checks/wakeups can be added to rcu_sched_clock_irq(). > > > > > swait_event_timeout_exclusive() appears innocent. > > > > > > > > I agree that it would rarely need to block, but if the task executing the > > > > synchronize_rcu() preempted one of the readers, wouldn't it have to block? > > > > Or am I missing some code path that excludes that possibility? > > > > > > As explained above, it means ~20secs delay during bootup which I don't > > > see happen. Once ksoftirqd is up, it is covered. > > > Also: If _more_ users require a timer to expire so the system can > > > continue to boot I am willing to investigate _why_ this is needed > > > because it does delay boot up progress of the system. > > > > Just to double-check -- in RT, ksoftirqd is up and running before the boot > > sequence reaches the call to rcu_end_inkernel_boot() from kernel_init()? > > Yes, this is the case. > rcu_end_inkernel_boot() is really at the end of the boot process, we > reached even SYSTEM_RUNNING ;) Whew! ;-) > > This looks good to me, independent of the other discussion. From what > > I can see, this code path is infrequent, so using hrtimer all the time > > should not be a problem. > > Okay. > > > Unless there are objections, I will pull this one in. > > Good. Less patches for me to carry around. > Thank you. And here is what I ended up hand-applying, presumably due to differences between -rt and -rcu. I also expanded the changelog a bit. Could you please check to make sure that I didn't mess something up? Oh, and I also couldn't resist taking advantage of 100 columns... What can I say? Thanx, Paul ------------------------------------------------------------------------ commit 2895b3bb5f8a0ebe565c62b1d2e3e1efca669962 Author: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx> Date: Tue Mar 8 09:54:13 2022 -0800 rcu-tasks: Use schedule_hrtimeout_range() to wait for grace periods The synchronous RCU-tasks grace-period-wait primitives invoke schedule_timeout_idle() to give readers a chance to exit their read-side critical sections. Unfortunately, this fails during early boot on PREEMPT_RT because PREEMPT_RT relies solely on ksoftirqd to run timer handlers. Because ksoftirqd cannot operate until its kthreads are spawned, there is a brief period of time following scheduler initialization where PREEMPT_RT cannot run the timer handlers that schedule_timeout_idle() relies on, resulting in a hang. To avoid this boot-time hang, this commit replaces schedule_timeout_idle() with schedule_hrtimeout(), so that the timer expires in hardirq context. This is ensures that the timer fires even on PREEMPT_RT throughout the irqs-enabled portions of boot as well as during runtime. The timer is set to expire between fract and fract + HZ / 2 jiffies in order to align with any other timers that might expire during that time, thus reducing the number of wakeups. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx> Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index 79eb280b1cd2a..bbad6b3376947 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -652,13 +652,16 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp) fract = rtp->init_fract; while (!list_empty(&holdouts)) { + ktime_t exp; bool firstreport; bool needreport; int rtst; // Slowly back off waiting for holdouts set_tasks_gp_state(rtp, RTGS_WAIT_SCAN_HOLDOUTS); - schedule_timeout_idle(fract); + exp = jiffies_to_nsecs(fract); + __set_current_state(TASK_IDLE); + schedule_hrtimeout_range(&exp, jiffies_to_nsecs(HZ / 2), HRTIMER_MODE_REL_HARD); if (fract < HZ) fract++;