Re: [PATCH RFC v2] rcu: Add a minimum time for marking boot as completed

Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> · Tue, 28 Feb 2023 20:09:02 +0000

Hi Frederic,

On Tue, Feb 28, 2023 at 12:04:36PM +0100, Frederic Weisbecker wrote:
> On Tue, Feb 28, 2023 at 01:30:25AM +0000, Joel Fernandes wrote:
> > On Tue, Feb 28, 2023 at 12:40:38AM +0100, Frederic Weisbecker wrote:
> > > On Mon, Feb 27, 2023 at 03:05:02PM -0800, Paul E. McKenney wrote:
> > > > On Mon, Feb 27, 2023 at 02:10:30PM -0500, Joel Fernandes wrote:
> > > > 
> > > > The combination of sysfs manipulated by userspace and a kernel failsafe
> > > > makes sense to me.  Especially if by default triggering the failsafe
> > > > splats.  That way, bugs where userspace fails to update the sysfs file
> > > > get caught.
> > > > 
> > > > The non-default silent-failsafe mode is also useful to allow some power
> > > > savings in advance of userspace getting the sysfs updating in place.
> > > > And of course the default splatting setup can be used in internal testing
> > > > with the release software being more tolerant of userspace foibles.
> > > 
> > > I'm wondering, this is all about CONFIG_RCU_LAZY, right? Or does also expedited
> > > GP turned off a bit early or late on boot matter for anybody in practice?
> > 
> > Yes, if you provide 'rcu_normal_after_boot', then after the boot ends, it
> > switches expedited GPs to normal ones.
> > 
> > It is the same issue for expedited, the kernel's version of what is 'boot' is
> > much shorter than what is actually boot.
> > 
> > This is also the case with suspend/resume's rcu_pm_notify(). See the comment:
> >   /*
> >    * On non-huge systems, use expedited RCU grace periods to make suspend
> >    * and hibernation run faster.
> >    */
> > 
> > There also we turn on/off both lazy and expedited. I don't see why we
> > shouldn't do it for boot.
> 
> Of course but I mean currently rcu_end_inkernel_boot() is called explicitly
> before the kernel calls init. From that point on, what is the source of the
> issue? Delaying lazy further would be enough or do we really need to delay
> forcing expedited as well? Or is it the reverse: delaying expedited further
> would matter and lazy doesn't play much role from there.

Both should play a role. For lazy, we found callbacks that showed later in
the full boot sequence (like the SCSI issue).

For expedited, there is new data from Qiuxu showing 5% improvement in boot
time.

> It matters to know because if delaying expedited further is enough, then indeed
> we must delay the call to rcu_end_inkernel_boot() somehow. But if delaying
> expedited further doesn't matter and delaying lazy matter then it's possible
> that the issue is a callback that should be marked as call_rcu_hurry() and then
> the source of the problem is much broader.

Right, and we also don't know if in the future, somebody queues a CB that
slows down boot as well (say they queue a lazy CB that does a wakeup), even
if currently there are not any such. As noted, that SCSI issue did show. Just
to note, callbacks doing wakeups are supposed to call call_rcu_hurry().

> I think the confusion comes from the fact that your changelog doesn't state precisely
> what the problem exactly is. Also do we need to wait for the kernel boot completion?
> And if so what is missing from kernel boot after the current explicit call to
> rcu_end_inkernel_boot()?

Yes, sorry, it was more an RFC but still should have been more clear. For the
v3 I'll definitely make it clear.

rcu_end_inkernel_boot() is called before init is run. But the kernel cannot
posibly know when init has finished running and say the system is now waiting
for user login, or something. There's a considerable amount time from
rcu_end_inkernel_boot() to when the system is actually "booted". That's the
main issue. We could look at CPU load, but that's not ideal. Maybe wait for
user input, but that sucks as well.

> Or do we also need to wait for userspace to complete the boot? Different
> problems, different solutions.
> 
> But in any case a countdown is not a way to go. Consider that rcu_lazy may
> be used by a larger audience than just chromium in the long run. You can not
> ask every admin to provide his own estimation per type of machine. You can't
> either rely on a long default value because that may have bad impact on
> workload assumptions launched right after boot.

Hmmm I see what you mean, so a conservative and configurable "fail-safe"
timeout followed by sysctl to end the boot earlier than the timeout, should
do it (something like 30 seconds IMHO sounds reasonable)? In any case,
whatever way we go, we would not end the kernel boot before
rcu_end_inkernel_boot() is called at least once (which is the current
behavior).

So it would be:

  low level boot + initcalls
       20 sec                         30 second timeout
|------------------------------|--------------------------
                               |                         |
	        old rcu_end_inkernel_boot()      new rcu_end_inkernel_boot()

But it could be, if user decides:
  low level boot + initcalls
       20 sec                         10 second timeout
|------------------------------|--------------------------
                               |                         |
	        old rcu_end_inkernel_boot()      new rcu_end_inkernel_boot()
		                                 via /sys/ entry.

> > > So shouldn't we disable lazy callbacks by default when CONFIG_RCU_LAZY=y and then
> > > turn it on with "sysctl kernel.rcu.lazy=1" only whenever userspace feels ready
> > > about it? We can still keep the current call to rcu_end_inkernel_boot().
> > 
> > Hmm IMHO that would add more knobs for not much reason honestly. We already
> > have CONFIG_RCU_LAZY default disabled, I really don't want to add more
> > dependency (like user enables the config and does not see laziness).
> 
> I don't know. Like I said, different problems, different solutions. Let's
> identify what the issue is precisely. For example can we expect that the issues
> on boot can be a problem also on some temporary workloads?
> 
> Besides I'm currently testing a very hacky flavour of rcu_lazy and so far it
> shows many idle calls that would have been delayed if callbacks weren't queued
> as lazy.

Can you provide more details? What kind of hack flavor, and what is it doing?

thanks,

 - Joel

> I have yet to do actual energy and performance measurements but if it
> happens to show improvements, I suspect distros will want a supported yet
> default disabled Kconfig that can be turned on on boot or later. Of course we
> are not there yet but things to keep in mind...
> 
> Thanks.