On Tue, Nov 21, 2023 at 02:13:53PM +0800, Z qiang wrote: > > > > On Mon, Nov 20, 2023 at 09:17:57PM -0800, Paul E. McKenney wrote: > > > On Mon, Nov 20, 2023 at 07:26:05PM -0800, Ankur Arora wrote: > > > > > > > > Paul E. McKenney <paulmck@xxxxxxxxxx> writes: > > > > > On Tue, Nov 07, 2023 at 01:57:34PM -0800, Ankur Arora wrote: > > > > >> cond_resched() is used to provide urgent quiescent states for > > > > >> read-side critical sections on PREEMPT_RCU=n configurations. > > > > >> This was necessary because lacking preempt_count, there was no > > > > >> way for the tick handler to know if we were executing in RCU > > > > >> read-side critical section or not. > > > > >> > > > > >> An always-on CONFIG_PREEMPT_COUNT, however, allows the tick to > > > > >> reliably report quiescent states. > > > > >> > > > > >> Accordingly, evaluate preempt_count() based quiescence in > > > > >> rcu_flavor_sched_clock_irq(). > > > > >> > > > > >> Suggested-by: Paul E. McKenney <paulmck@xxxxxxxxxx> > > > > >> Signed-off-by: Ankur Arora <ankur.a.arora@xxxxxxxxxx> > > > > >> --- > > > > >> kernel/rcu/tree_plugin.h | 3 ++- > > > > >> kernel/sched/core.c | 15 +-------------- > > > > >> 2 files changed, 3 insertions(+), 15 deletions(-) > > > > >> > > > > >> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h > > > > >> index f87191e008ff..618f055f8028 100644 > > > > >> --- a/kernel/rcu/tree_plugin.h > > > > >> +++ b/kernel/rcu/tree_plugin.h > > > > >> @@ -963,7 +963,8 @@ static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp) > > > > >> */ > > > > >> static void rcu_flavor_sched_clock_irq(int user) > > > > >> { > > > > >> - if (user || rcu_is_cpu_rrupt_from_idle()) { > > > > >> + if (user || rcu_is_cpu_rrupt_from_idle() || > > > > >> + !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) { > > > > > > > > > > This looks good. > > > > > > > > > >> /* > > > > >> * Get here if this CPU took its interrupt from user > > > > >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > > > >> index bf5df2b866df..15db5fb7acc7 100644 > > > > >> --- a/kernel/sched/core.c > > > > >> +++ b/kernel/sched/core.c > > > > >> @@ -8588,20 +8588,7 @@ int __sched _cond_resched(void) > > > > >> preempt_schedule_common(); > > > > >> return 1; > > > > >> } > > > > >> - /* > > > > >> - * In preemptible kernels, ->rcu_read_lock_nesting tells the tick > > > > >> - * whether the current CPU is in an RCU read-side critical section, > > > > >> - * so the tick can report quiescent states even for CPUs looping > > > > >> - * in kernel context. In contrast, in non-preemptible kernels, > > > > >> - * RCU readers leave no in-memory hints, which means that CPU-bound > > > > >> - * processes executing in kernel context might never report an > > > > >> - * RCU quiescent state. Therefore, the following code causes > > > > >> - * cond_resched() to report a quiescent state, but only when RCU > > > > >> - * is in urgent need of one. > > > > >> - * / > > > > >> -#ifndef CONFIG_PREEMPT_RCU > > > > >> - rcu_all_qs(); > > > > >> -#endif > > > > > > > > > > But... > > > > > > > > > > Suppose we have a long-running loop in the kernel that regularly > > > > > enables preemption, but only momentarily. Then the added > > > > > rcu_flavor_sched_clock_irq() check would almost always fail, making > > > > > for extremely long grace periods. > > > > > > > > So, my thinking was that if RCU wants to end a grace period, it would > > > > force a context switch by setting TIF_NEED_RESCHED (and as patch 38 mentions > > > > RCU always uses the the eager version) causing __schedule() to call > > > > rcu_note_context_switch(). > > > > That's similar to the preempt_schedule_common() case in the > > > > _cond_resched() above. > > > > > > But that requires IPIing that CPU, correct? > > > > > > > But if I see your point, RCU might just want to register a quiescent > > > > state and for this long-running loop rcu_flavor_sched_clock_irq() does > > > > seem to fall down. > > > > > > > > > Or did I miss a change that causes preempt_enable() to help RCU out? > > > > > > > > Something like this? > > > > > > > > diff --git a/include/linux/preempt.h b/include/linux/preempt.h > > > > index dc5125b9c36b..e50f358f1548 100644 > > > > --- a/include/linux/preempt.h > > > > +++ b/include/linux/preempt.h > > > > @@ -222,6 +222,8 @@ do { \ > > > > barrier(); \ > > > > if (unlikely(preempt_count_dec_and_test())) \ > > > > __preempt_schedule(); \ > > > > + if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) \ > > > > + rcu_all_qs(); \ > > > > } while (0) > > > > > > Or maybe something like this to lighten the load a bit: > > > > > > #define preempt_enable() \ > > > do { \ > > > barrier(); \ > > > if (unlikely(preempt_count_dec_and_test())) { \ > > > __preempt_schedule(); \ > > > if (raw_cpu_read(rcu_data.rcu_urgent_qs) && \ > > > !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) \ > > > rcu_all_qs(); \ > > > } \ > > > } while (0) > > > > > > And at that point, we should be able to drop the PREEMPT_MASK, not > > > that it makes any difference that I am aware of: > > > > > > #define preempt_enable() \ > > > do { \ > > > barrier(); \ > > > if (unlikely(preempt_count_dec_and_test())) { \ > > > __preempt_schedule(); \ > > > if (raw_cpu_read(rcu_data.rcu_urgent_qs) && \ > > > !(preempt_count() & SOFTIRQ_MASK)) \ > > > rcu_all_qs(); \ > > > } \ > > > } while (0) > > > > > > Except that we can migrate as soon as that preempt_count_dec_and_test() > > > returns. And that rcu_all_qs() disables and re-enables preemption, > > > which will result in undesired recursion. Sigh. > > > > > > So maybe something like this: > > > > > > #define preempt_enable() \ > > > do { \ > > > if (raw_cpu_read(rcu_data.rcu_urgent_qs) && \ > > > !(preempt_count() & SOFTIRQ_MASK)) \ > > > > Sigh. This needs to include (PREEMPT_MASK | SOFTIRQ_MASK), > > but check for equality to something like (1UL << PREEMPT_SHIFT). > > > > For PREEMPT_RCU=n and CONFIG_PREEMPT_COUNT=y kernels > for report QS in preempt_enable(), we can refer to this: > > void rcu_read_unlock_strict(void) > { > struct rcu_data *rdp; > > if (irqs_disabled() || preempt_count() || !rcu_state.gp_kthread) > return; > rdp = this_cpu_ptr(&rcu_data); > rdp->cpu_no_qs.b.norm = false; > rcu_report_qs_rdp(rdp); > udelay(rcu_unlock_delay); > } > > The rcu critical section may be in the NMI handler needs to be considered. You are quite right, though one advantage of leveraging preempt_enable() is that it cannot really enable preemption in an NMI handler. But yes, that might need to be accounted for in the comparison with preempt_count(). The actual condition needs to also allow for the possibility that this preempt_enable() happened in a kernel built with preemptible RCU. And probably a few other things that I have not yet thought of. For one thing, rcu_implicit_dynticks_qs() might need adjustment. Though I am currently hoping that it will still be able to enlist the help of other things, for example, preempt_enable() and local_bh_enable(). Yes, it is the easiest thing in the world to just whip out the resched_cpu() hammer earlier in the grace period, and maybe that is the eventual solution. But I would like to try avoiding the extra IPIs if that can be done reasonably. ;-) Thanx, Paul > Thanks > Zqiang > > > > > > > Clearly time to sleep. :-/ > > > > Thanx, Paul > > > > > rcu_all_qs(); \ > > > barrier(); \ > > > if (unlikely(preempt_count_dec_and_test())) { \ > > > __preempt_schedule(); \ > > > } \ > > > } while (0) > > > > > > Then rcu_all_qs() becomes something like this: > > > > > > void rcu_all_qs(void) > > > { > > > unsigned long flags; > > > > > > /* Load rcu_urgent_qs before other flags. */ > > > if (!smp_load_acquire(this_cpu_ptr(&rcu_data.rcu_urgent_qs))) > > > return; > > > this_cpu_write(rcu_data.rcu_urgent_qs, false); > > > if (unlikely(raw_cpu_read(rcu_data.rcu_need_heavy_qs))) { > > > local_irq_save(flags); > > > rcu_momentary_dyntick_idle(); > > > local_irq_restore(flags); > > > } > > > rcu_qs(); > > > } > > > EXPORT_SYMBOL_GPL(rcu_all_qs); > > > > > > > Though I do wonder about the likelihood of hitting the case you describe > > > > and maybe instead of adding the check on every preempt_enable() > > > > it might be better to instead force a context switch in the > > > > rcu_flavor_sched_clock_irq() (as we do in the PREEMPT_RCU=y case.) > > > > > > Maybe. But rcu_all_qs() is way lighter weight than a context switch. > > > > > > Thanx, Paul