Re: [PATCH] rcu: Delay the RCU-selftests during boot.

Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx> · Mon, 7 Mar 2022 18:54:09 +0100

On 2022-03-04 21:00:25 [-0800], Paul E. McKenney wrote:
> > During SYSTEM_BOOTING we could do softirqs right away but we lack the
> > infrastructure. Starting with SYSTEM_SCHEDULING we rely on the thread so
> > it needs to be spawned earlier. The problem with SYSTEM_SCHEDULING+ is
> > that we may deadlock if the softirqs and performed in IRQ-context.
> 
> Understood.  My goal is to prevent RCU from being yet another odd
> constraint that people writing boot-time code need to worry about.
> Or at least no additional odd constraints than the ones that it already
> presents.  :-/

Do we have that many people doing boot-time code before
early_initcall()?

> > > This might seem a bit utopian or even unreasonable, but please keep in
> > > mind that both the scheduler and the idle loop use RCU.
> > 
> > But the problem is only the usage of synchronize_rcu().
> 
> And synchronize_rcu_expedited(), but yes in that call_rcu() and so
> on still work.

That should be the majority of the users.

> >                                                         So
> > rcu_read_lock() and call_rcu() works. Only synchronize_rcu() does not.
> > Couldn't we make a rule to use at earliest within early_initcall()?
> 
> Of course we could make such a rule.
> 
> And sometimes, people running into problems with that rule might be able
> to move their code earlier or later and avoid problems.  But other times
> they have to do something else.  Which will sometimes mean that we are
> asking them to re-implement some odd special case of RCU within their
> own subsystem, which just does not sound like a good idea.
> 
> In face, my experience indicates that it is way easier to make RCU work
> more globally than to work all the issues stemming from these sorts of
> limits on RCU users.  Takes less time, also.
> 
> And it probably is not all -that- hard.

We had one user _that_ early and he moved away. People might
misunderstand things or optimize for something not really needed. If
this is needed _before_ early_initcall() we could still move it right
after the scheduler is initialized. I would just prefer not to optimize
for things that might be never needed.
For instance flush_workqueue() is made "working" a few functions
earlier (before the RCU selftest). You could enqueue work items earlier,
they would just wait until workqueue_init().

> > > However, that swait_event_timeout_exclusive() doesn't need exact timing
> > > during boot.  Anything that let other tasks run for at least a few tens
> > > of microseconds (up to say a millisecond) could easily work just fine.
> > > Is there any such thing in RT?
> > 
> > swait_event_timeout_exclusive() appears not to be the culprit. It is
> > invoked a few times (with a 6.5ms timeout) but returns without setting
> > up a timer. So either my setup avoids the timer or this happens always
> > and is not related to my config).
> 
> Now that you mention it, yes.  There is only one CPU, so unless you have
> an odd series of preemptions, it quickly figures out that it does not
> need to wait.  But that odd series of preemptions really is out there,
> patiently waiting for us to lose context on this code.

Correct, verified. But this means, that a task within a rcu_read_lock()
section gets preempted for > 26 seconds before that timer fires. That
delay during boot implies that something went wrong while it might
happen at run-time under "normal" circumstances. So I wouldn't try to
get this case covered.

> > rcu_tasks_wait_gp() does schedule_timeout_idle() and this is the one
> > that blocks. This could be replaced with schedule_hrtimeout() (just
> > tested). I hate the idea to use a precise delay in a timeout like
> > situation. But we could use schedule_hrtimeout_range() with a HZ delta
> > so it kind of feels like the timer_list timer ;)
> 
> If schedule_hrtimeout_range() works, I am good with it.
> And you are right, precision is not required here.  And maybe
> schedule_hrtimeout_range() could also be used to create a crude
> boot-time-only polling loop for the swait_event_timeout_exclusive()?

I made something to cover the schedule_hrtimeout_range(). I wouldn't
bother with swait_event_timeout_exclusive() due to large timeout _and_
we are boot-up.

> > swait_event_timeout_exclusive() appears innocent.
> 
> I agree that it would rarely need to block, but if the task executing the
> synchronize_rcu() preempted one of the readers, wouldn't it have to block?
> Or am I missing some code path that excludes that possibility?

As explained above, it means ~20secs delay during bootup which I don't
see happen. Once ksoftirqd is up, it is covered.
Also: If _more_ users require a timer to expire so the system can
continue to boot I am willing to investigate _why_ this is needed
because it does delay boot up progress of the system.

> > > These would be conditioned on IS_ENABLED(CONFIG_PREEMPT_RT).
> > > 
> > > But now you are going to tell me that wakeups cannot be done from the
> > > scheduler tick interrupt handler?  If that is the case, are there other
> > > approaches?
> > 
> > If you by my irqwork patch then I think we are down to:
> > - spawn ksoftirqd early
> > - use during boot schedule_hrtimeout() or the whole time (no I idea how
> >   often this triggers).
> 
> The boot-time schedule_hrtimeout_range() seems to cover things, especially
> given that most of the time there would be no need to block.  Or is
> there yet another gap where schedule_hrtimeout_range() does not work?
> (After the scheduler starts.)

The patch below covers it. This works once the system has a working
timer which aligns with !RT.
I've been testing this and understand that tracing is using it. I didn't
manage to trigger it after boot so I assume here that the user can't
easily trigger that timer _very_ often.

-------->8-----


From: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx>
Date: Mon, 7 Mar 2022 17:08:23 +0100
Subject: [PATCH] rcu-tasks: Use schedule_hrtimeout_range() while waiting for
 the gp.

The RCU selftest is using schedule_timeout_idle() which fails on
PREEMPT_RT because it is used early in boot-up phase an which point
ksoftirqd is not yet ready and is required for the timer to expire.

To avoid this lockup, use schedule_hrtimeout() and let the timer expire
in hardirq context. This is ensures that the timer fires even on
PREEMPT_RT without any further requirement.

The timer is set to expire between fract and fract + HZ / 2 jiffies in
order to minimize the amount of extra wake ups and to allign with
possible other timer which expire within this window.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx>
---
 kernel/rcu/tasks.h | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index f804afb304135..e99f9e61cc7a3 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -630,12 +630,15 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
 	while (!list_empty(&holdouts)) {
 		bool firstreport;
 		bool needreport;
+		ktime_t exp;
 		int rtst;
 
 		/* Slowly back off waiting for holdouts */
 		set_tasks_gp_state(rtp, RTGS_WAIT_SCAN_HOLDOUTS);
-		schedule_timeout_idle(fract);
-
+		exp = jiffies_to_nsecs(fract);
+		__set_current_state(TASK_IDLE);
+		schedule_hrtimeout_range(&exp, jiffies_to_nsecs(HZ / 2),
+					 HRTIMER_MODE_REL_HARD);
 		if (fract < HZ)
 			fract++;
 
-- 
2.35.1