On Tue, Sep 20, 2022 at 3:03 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote: [...] > > > > > >>>> Are you saying we should not make call_rcu() default-lazy for > > > > > >>>> everyone? According to Thomas, anyone expecting call_rcu() to finish > > > > > >>>> in any reasonable amount of time is a bug, and that usecase needs to > > > > > >>>> be fixed to use synchronize_rcu(). Are we saying now that that's not > > > > > >>>> our direction? > > > > > >>> > > > > > >>> I am saying that Thomas can sometimes get quite enthusiastic about > > > > > >>> a particular direction. Plus I strongly suspect that Thomas has not > > > > > >>> actually reviewed all of the RCU callbacks. > > > > > >>> > > > > > >>> Of course, I agree that it won't hurt to try out Thomas's suggestion, > > > > > >>> which is why I did not object at the time. But we should be fully > > > > > >>> prepared for it to blow up in our faces. ;-) > > > > > >> > > > > > >> Makes sense, thanks for clarifying. My impression was he does expect some pain > > > > > >> in the short-term but the pain is worth not introduction a new opt-in API. > > > > > >> > > > > > >> I am wondering if putting it behind a CONFIG option will delay the pain too > > > > > >> much. Thoughts? I am assuming that we want the CONFIG to turn it off by default. > > > > > > > > > > > > I tried a poor man's version of Thomas's idea by doing > > > > > > jiffies_till_first_fqs=1000 and rcu.rcu_expedited=1 boot parameters. Of course > > > > > > this just for trying has no regard for memory pressure and such so it is in the > > > > > > 'do not try at home' category. > > > > > > > > > > > > Then I tried jiffies_till_first_fqs=128 and compare before/after kernel logs. > > > > > > > > > > > > These are on ChromeOS hardware (5.19-rc4 kernel) > > > > > > For jiffies_till_first_fqs=1000 Everything is fine until this point: > > > > > > 8.383646] Freeing unused kernel image (initmem) memory: 1676K > > > > > > 11.378880] Write protecting the kernel read-only data: 28672k > > > > > > > > > > > > I do not see this 3 second delay with jiffies_till_first_fqs=128. > > > > > > 8.411866] Freeing unused kernel image (initmem) memory: 1676K > > > > > > 8.656701] Write protecting the kernel read-only data: 28672k > > > > > > > > > > At least this could be because of the rodata write protection stuff calling > > > > > rcu_barrier(): > > > > > > > > > > From init/main.c > > > > > > > > > > #ifdef CONFIG_STRICT_KERNEL_RWX > > > > > static void mark_readonly(void) > > > > > { > > > > > if (rodata_enabled) { > > > > > /* > > > > > * load_module() results in W+X mappings, which are cleaned > > > > > * up with call_rcu(). Let's make sure that queued work is > > > > > * flushed so that we don't hit false positives looking for > > > > > * insecure pages which are W+X. > > > > > */ > > > > > rcu_barrier(); > > > > > mark_rodata_ro(); > > > > > rodata_test(); > > > > > } else > > > > > pr_info("Kernel memory protection disabled.\n"); > > > > > } > > > > > > > > > > > > > > > I think rcu_barrier() does not play well with jiffiest_till_first_fqs (probably > > > > > because of the missing wake ups I found in the existing code, I should try that > > > > > patch on this test..). > > > > > > > > > > Is it expected that a sleep involving jiffies_till_first_fqs does not get > > > > > disturbed due to rcu_barrier() ? That seems like a bug to me. At least for > > > > > !NOCB, I don't see any call the rcu_gp_kthread_wake() from rcu_barrier()! > > > > > > Another approach is to take your current stack of patches, but only those > > > to RCU, and then as Thomas suggests, make call_rcu() behave lazily and > > > make a call_rcu_expedited() that retains the current behavior. Keep the > > > current grace-period timings. > > > > Yes these sound good. But one issue is, we do not know all the cases that > > needed call_rcu_expedited(). There is the per-cpu refcount which needs it, > > but I'm not aware of any others. If you take the patches, can we add convert > > to call_rcu_expedited() as regressions happen? > > Exactly. > > Several of the calls to wait_rcu_gp() would need their arguments to > change from call_rcu() to call_rcu_expedited(). Oddly enough, including > the call from synchronize_rcu_expedited() in the case where there are > to be no expedited grace periods. ;-) Oh true, I totally missed that. So if we are going with it this way (everything async is lazy and then selectively make them non-lazy), then we can't do without a call_rcu_expedited() API in the patchset. Right? So I'll go work on that API then. It should be somewhat trivial to do, I think. One question - call_rcu_expedited() just means "use regular non lazy rcu", not "use expedited grace periods" right? In this case, is it better to call it call_rcu_fast() or call_rcu_nonlazy() to avoid conflicts with the "expedited" terminology in the synchronize_{s,}rcu_expedited() APIs? > > > Then make synchronize_rcu() and friends use call_rcu_expedited() and see > > > if you can get away with only a small number of call_rcu_expedited() > > > "upgrades". > > > > Right. It is a bit of a chicken-and-egg problem, if we have to whitelist > > everything, then we have to blacklist as we go. If we blacklist everything, > > then we may miss whitelisting something. > > > > Thoughts? > > Try it and see what happens. We can always fall back to call_rcu_lazy(). Ok :) I'll rewrite my patchset this way and push it out soon without further ado, and I'll fix any regressions I notice by converting to the _fast() API. > > Meanwhile just to clarify- I am very much invested in seeing the current > > stack of patches merged soon give or take the _expedited() conversions, > > however I did the above jiffies_till_first experiments in the hopes they > > would show the ones that need such conversion quickly. > > I suggest doing the experiment on top of your patch stack, but excluding > the call_rcu_lazy() conversios. That way, as we come to agreement on > pieces we can pull the patches in, so that the experiments do not get > in the way. Yes, sounds good! thanks, - Joel