On Tue, Aug 13, 2019 at 07:48:09AM -0700, Paul E. McKenney wrote: > On Tue, Aug 13, 2019 at 02:30:19PM +0200, Frederic Weisbecker wrote: > > On Mon, Aug 12, 2019 at 04:23:16PM -0700, Paul E. McKenney wrote: > > > On Mon, Aug 12, 2019 at 11:02:33PM +0200, Frederic Weisbecker wrote: > > > > On Fri, Aug 02, 2019 at 08:15:01AM -0700, Paul E. McKenney wrote: > > > > Looks like it's not the right fix but, should you ever need to set an > > > > all-CPUs (system wide) tick dependency in the future, you can use tick_set_dep(). > > > > > > Indeed, I have dropped this patch, but I now do something similar in > > > RCU's CPU-hotplug notifiers. Which does have an effect, especially on > > > the system that isn't subject to the insane-latency cpu_relax(). > > > > > > Plus I am having to put a similar workaround into RCU's quiescent-state > > > forcing logic. > > > > > > But how should this really be done? > > > > > > Isn't there some sort of monitoring of nohz_full CPUs for accounting > > > purposes? If so, would it make sense for this monitoring to check for > > > long-duration kernel execution and enable the tick in this case? The > > > RCU dyntick machinery can be used to remotely detect the long-duration > > > kernel execution using something like the following: > > > > > > int nohz_in_kernel_snap = rcu_dynticks_snap_cpu(cpu); > > > > > > ... > > > > > > if (rcu_dynticks_in_eqs_cpu(cpu, nohz_in_kernel_snap) > > > nohz_in_kernel_snap = rcu_dynticks_snap_cpu(cpu); > > > else > > > /* Turn on the tick! */ > > > > > > I would supply rcu_dynticks_snap_cpu() and rcu_dynticks_in_eqs_cpu(), > > > which would be simple wrappers around RCU's private rcu_dynticks_snap() > > > and rcu_dynticks_in_eqs() functions. > > > > > > Would this make sense as a general solution, or am I missing a corner > > > case or three? > > > > Oh I see. Until now we considered than running into the kernel (between user/guest/idle) > > is supposed to be short but there can be specific places where it doesn't apply. > > > > I'm wondering if, more than just providing wrappers, this shouldn't be entirely > > driven by RCU using the tick_set_dep_cpu()/tick_clear_dep_cpu() at appropriate timings. > > > > I don't want to sound like I'm trying to put all the work on you :p It's just that > > the tick shouldn't know much about RCU, it's rather RCU that is a client for the tick and > > is probably better suited to determine when a CPU becomes annoying with its extended grace > > period. > > > > Arming a CPU timer could also be an alternative to tick_set_dep_cpu() for that. > > > > What do you think? > > Left to itself, RCU would take action only when a given nohz_full > in-kernel CPU was delaying a grace period, which is what the (lightly > tested) patch below is supposed to help with. If that is all that is > needed, well and good! > > But should we need long-running in-kernel nohz_full CPUs to turn on > their ticks when they are not blocking an RCU grace period, for example, > when RCU is idle, more will be needed. To that point, isn't there some > sort of monitoring that checks up on nohz_full CPUs ever second or so? Wouldn't such monitoring need to be more often than a second, given that rcu_urgent_qs and rcu_need_heavy_qs are configured typically to be sooner (200-300 jiffies on my system). > If so, perhaps that monitoring could periodically invoke an RCU function > that I provide for deciding when to turn the tick on. We would also need > to work out how to turn the tick off in a timely fashion once the CPU got > out of kernel mode, perhaps in rcu_user_enter() or rcu_nmi_exit_common(). > > If this would be called only every second or so, the separate grace-period > checking is still needed for its shorter timespan, though. > > Thoughts? Do you want me to test the below patch to see if it fixes the issue with my other test case (where I had a nohz full CPU holding up a grace period). 2 comments below: > ------------------------------------------------------------------------ > > commit 1cb89508804f6f2fdb79a1be032b1932d52318c4 > Author: Paul E. McKenney <paulmck@xxxxxxxxxxxxx> > Date: Mon Aug 12 16:14:00 2019 -0700 > > rcu: Force tick on for nohz_full CPUs not reaching quiescent states > > CPUs running for long time periods in the kernel in nohz_full mode > might leave the scheduling-clock interrupt disabled for then full > duration of their in-kernel execution. This can (among other things) > delay grace periods. This commit therefore forces the tick back on > for any nohz_full CPU that is failing to pass through a quiescent state > upon return from interrupt, which the resched_cpu() will induce. > > Reported-by: Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> > Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxx> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index 8c494a692728..8b8f5bffdc5a 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -651,6 +651,12 @@ static __always_inline void rcu_nmi_exit_common(bool irq) > */ > if (rdp->dynticks_nmi_nesting != 1) { > trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2, rdp->dynticks); > + if (tick_nohz_full_cpu(rdp->cpu) && > + rdp->dynticks_nmi_nesting == 2 && > + rdp->rcu_urgent_qs && !rdp->rcu_forced_tick) { > + rdp->rcu_forced_tick = true; > + tick_dep_set_cpu(rdp->cpu, TICK_DEP_MASK_RCU); > + } > WRITE_ONCE(rdp->dynticks_nmi_nesting, /* No store tearing. */ > rdp->dynticks_nmi_nesting - 2); > return; > @@ -886,6 +892,16 @@ void rcu_irq_enter_irqson(void) > local_irq_restore(flags); > } > > +/* > + * If the scheduler-clock interrupt was enabled on a nohz_full CPU > + * in order to get to a quiescent state, disable it. > + */ > +void rcu_disable_tick_upon_qs(struct rcu_data *rdp) > +{ > + if (tick_nohz_full_cpu(rdp->cpu) && rdp->rcu_forced_tick) > + tick_dep_clear_cpu(rdp->cpu, TICK_DEP_MASK_RCU); > +} > + > /** > * rcu_is_watching - see if RCU thinks that the current CPU is not idle > * > @@ -1980,6 +1996,7 @@ rcu_report_qs_rdp(int cpu, struct rcu_data *rdp) > if (!offloaded) > needwake = rcu_accelerate_cbs(rnp, rdp); > > + rcu_disable_tick_upon_qs(rdp); > rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags); > /* ^^^ Released rnp->lock */ > if (needwake) > @@ -2269,6 +2286,7 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp)) > int cpu; > unsigned long flags; > unsigned long mask; > + struct rcu_data *rdp; > struct rcu_node *rnp; > > rcu_for_each_leaf_node(rnp) { > @@ -2293,8 +2311,11 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp)) > for_each_leaf_node_possible_cpu(rnp, cpu) { > unsigned long bit = leaf_node_cpu_bit(rnp, cpu); > if ((rnp->qsmask & bit) != 0) { > - if (f(per_cpu_ptr(&rcu_data, cpu))) > + rdp = per_cpu_ptr(&rcu_data, cpu); > + if (f(rdp)) { > mask |= bit; > + rcu_disable_tick_upon_qs(rdp); > + } I am guessing this was the earlier thing you corrected, cool!! > } > } > if (mask != 0) { > @@ -2322,7 +2343,7 @@ void rcu_force_quiescent_state(void) > rnp = __this_cpu_read(rcu_data.mynode); > for (; rnp != NULL; rnp = rnp->parent) { > ret = (READ_ONCE(rcu_state.gp_flags) & RCU_GP_FLAG_FQS) || > - !raw_spin_trylock(&rnp->fqslock); > + !raw_spin_trylock(&rnp->fqslock); > if (rnp_old != NULL) > raw_spin_unlock(&rnp_old->fqslock); > if (ret) > @@ -2855,7 +2876,7 @@ static void rcu_barrier_callback(struct rcu_head *rhp) > { > if (atomic_dec_and_test(&rcu_state.barrier_cpu_count)) { > rcu_barrier_trace(TPS("LastCB"), -1, > - rcu_state.barrier_sequence); > + rcu_state.barrier_sequence); > complete(&rcu_state.barrier_completion); > } else { > rcu_barrier_trace(TPS("CB"), -1, rcu_state.barrier_sequence); > @@ -2879,7 +2900,7 @@ static void rcu_barrier_func(void *unused) > } else { > debug_rcu_head_unqueue(&rdp->barrier_head); > rcu_barrier_trace(TPS("IRQNQ"), -1, > - rcu_state.barrier_sequence); > + rcu_state.barrier_sequence); > } > rcu_nocb_unlock(rdp); > } > @@ -2906,7 +2927,7 @@ void rcu_barrier(void) > /* Did someone else do our work for us? */ > if (rcu_seq_done(&rcu_state.barrier_sequence, s)) { > rcu_barrier_trace(TPS("EarlyExit"), -1, > - rcu_state.barrier_sequence); > + rcu_state.barrier_sequence); > smp_mb(); /* caller's subsequent code after above check. */ > mutex_unlock(&rcu_state.barrier_mutex); > return; > @@ -2938,11 +2959,11 @@ void rcu_barrier(void) > continue; > if (rcu_segcblist_n_cbs(&rdp->cblist)) { > rcu_barrier_trace(TPS("OnlineQ"), cpu, > - rcu_state.barrier_sequence); > + rcu_state.barrier_sequence); > smp_call_function_single(cpu, rcu_barrier_func, NULL, 1); > } else { > rcu_barrier_trace(TPS("OnlineNQ"), cpu, > - rcu_state.barrier_sequence); > + rcu_state.barrier_sequence); > } > } > put_online_cpus(); > @@ -3168,6 +3189,7 @@ void rcu_cpu_starting(unsigned int cpu) > rdp->rcu_onl_gp_seq = READ_ONCE(rcu_state.gp_seq); > rdp->rcu_onl_gp_flags = READ_ONCE(rcu_state.gp_flags); > if (rnp->qsmask & mask) { /* RCU waiting on incoming CPU? */ > + rcu_disable_tick_upon_qs(rdp); > /* Report QS -after- changing ->qsmaskinitnext! */ > rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags); Just curious about the existing code. If a CPU is just starting up (after bringing it online), how can RCU be waiting on it? I thought RCU would not be watching offline CPUs. thanks, - Joel [snip]