Re: [PATCH v5 00/18] Implement call_rcu_lazy() and miscellaneous fixes

Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> · Tue, 20 Sep 2022 18:27:52 -0400

On Tue, Sep 20, 2022 at 3:03 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
[...]
> > > > > >>>> Are you saying we should not make call_rcu() default-lazy for
> > > > > >>>> everyone? According to Thomas, anyone expecting call_rcu() to finish
> > > > > >>>> in any reasonable amount of time is a bug, and that usecase needs to
> > > > > >>>> be fixed to use synchronize_rcu(). Are we saying now that that's not
> > > > > >>>> our direction?
> > > > > >>>
> > > > > >>> I am saying that Thomas can sometimes get quite enthusiastic about
> > > > > >>> a particular direction.  Plus I strongly suspect that Thomas has not
> > > > > >>> actually reviewed all of the RCU callbacks.
> > > > > >>>
> > > > > >>> Of course, I agree that it won't hurt to try out Thomas's suggestion,
> > > > > >>> which is why I did not object at the time.  But we should be fully
> > > > > >>> prepared for it to blow up in our faces.  ;-)
> > > > > >>
> > > > > >> Makes sense, thanks for clarifying. My impression was he does expect some pain
> > > > > >> in the short-term but the pain is worth not introduction a new opt-in API.
> > > > > >>
> > > > > >> I am wondering if putting it behind a CONFIG option will delay the pain too
> > > > > >> much. Thoughts? I am assuming that we want the CONFIG to turn it off by default.
> > > > > >
> > > > > > I tried a poor man's version of Thomas's idea by doing
> > > > > > jiffies_till_first_fqs=1000 and rcu.rcu_expedited=1 boot parameters. Of course
> > > > > > this just for trying has no regard for memory pressure and such so it is in the
> > > > > > 'do not try at home' category.
> > > > > >
> > > > > > Then I tried jiffies_till_first_fqs=128 and compare before/after kernel logs.
> > > > > >
> > > > > > These are on ChromeOS hardware (5.19-rc4 kernel)
> > > > > > For jiffies_till_first_fqs=1000 Everything is fine until this point:
> > > > > >     8.383646] Freeing unused kernel image (initmem) memory: 1676K
> > > > > >    11.378880] Write protecting the kernel read-only data: 28672k
> > > > > >
> > > > > > I do not see this 3 second delay with jiffies_till_first_fqs=128.
> > > > > >     8.411866] Freeing unused kernel image (initmem) memory: 1676K
> > > > > >     8.656701] Write protecting the kernel read-only data: 28672k
> > > > >
> > > > > At least this could be because of the rodata write protection stuff calling
> > > > > rcu_barrier():
> > > > >
> > > > > From init/main.c
> > > > >
> > > > > #ifdef CONFIG_STRICT_KERNEL_RWX
> > > > > static void mark_readonly(void)
> > > > > {
> > > > >         if (rodata_enabled) {
> > > > >                 /*
> > > > >                  * load_module() results in W+X mappings, which are cleaned
> > > > >                  * up with call_rcu().  Let's make sure that queued work is
> > > > >                  * flushed so that we don't hit false positives looking for
> > > > >                  * insecure pages which are W+X.
> > > > >                  */
> > > > >                 rcu_barrier();
> > > > >                 mark_rodata_ro();
> > > > >                 rodata_test();
> > > > >         } else
> > > > >                 pr_info("Kernel memory protection disabled.\n");
> > > > > }
> > > > >
> > > > >
> > > > > I think rcu_barrier() does not play well with jiffiest_till_first_fqs (probably
> > > > > because of the missing wake ups I found in the existing code, I should try that
> > > > > patch on this test..).
> > > > >
> > > > > Is it expected that a sleep involving jiffies_till_first_fqs does not get
> > > > > disturbed due to rcu_barrier() ? That seems like a bug to me. At least for
> > > > > !NOCB, I don't see any call the rcu_gp_kthread_wake() from rcu_barrier()!
> > >
> > > Another approach is to take your current stack of patches, but only those
> > > to RCU, and then as Thomas suggests, make call_rcu() behave lazily and
> > > make a call_rcu_expedited() that retains the current behavior.  Keep the
> > > current grace-period timings.
> >
> > Yes these sound good. But one issue is, we do not know all the cases that
> > needed call_rcu_expedited(). There is the per-cpu refcount which needs it,
> > but I'm not aware of any others. If you take the patches, can we add convert
> > to call_rcu_expedited() as regressions happen?
>
> Exactly.
>
> Several of the calls to wait_rcu_gp() would need their arguments to
> change from call_rcu() to call_rcu_expedited().  Oddly enough, including
> the call from synchronize_rcu_expedited() in the case where there are
> to be no expedited grace periods.  ;-)

Oh true, I totally missed that. So if we are going with it this way
(everything async is lazy and then selectively make them non-lazy),
then we can't do without a call_rcu_expedited() API in the patchset.
Right? So I'll go work on that API then. It should be somewhat trivial
to do, I think.

One question - call_rcu_expedited() just means "use regular non lazy
rcu", not "use expedited grace periods" right? In this case, is it
better to call it call_rcu_fast() or call_rcu_nonlazy() to avoid
conflicts with the "expedited" terminology in the
synchronize_{s,}rcu_expedited() APIs?

> > > Then make synchronize_rcu() and friends use call_rcu_expedited() and see
> > > if you can get away with only a small number of call_rcu_expedited()
> > > "upgrades".
> >
> > Right. It is a bit of a chicken-and-egg problem, if we have to whitelist
> > everything, then we have to blacklist as we go. If we blacklist everything,
> > then we may miss whitelisting something.
> >
> > Thoughts?
>
> Try it and see what happens.  We can always fall back to call_rcu_lazy().

Ok :) I'll rewrite my patchset this way and push it out soon without
further ado, and I'll fix any regressions I notice by converting to
the _fast() API.

> > Meanwhile just to clarify- I am very much invested in seeing the current
> > stack of patches merged soon give or take the _expedited() conversions,
> > however I did the above jiffies_till_first experiments in the hopes they
> > would show the ones that need such conversion quickly.
>
> I suggest doing the experiment on top of your patch stack, but excluding
> the call_rcu_lazy() conversios.  That way, as we come to agreement on
> pieces we can pull the patches in, so that the experiments do not get
> in the way.

Yes, sounds good!

thanks,

 - Joel