>> >>>>>>>>>>> >>>>>>>>>>> Currently rcu_cpu_stall_reset() set rcu_state.jiffies_stall to one check >>>>>>>>>>> period later, i.e. jiffies + rcu_jiffies_till_stall_check(). But jiffies >>>>>>>>>>> is only updated in the timer interrupt, so when kgdb_cpu_enter() begins >>>>>>>>>>> to run there may already be nearly one rcu check period after jiffies. >>>>>>>>>>> Since all interrupts are disabled during kgdb_cpu_enter(), jiffies will >>>>>>>>>>> not be updated. When kgdb_cpu_enter() returns, rcu_state.jiffies_stall >>>>>>>>>>> maybe already gets timeout. >>>>>>>>>>> >>>>>>>>>>> We can set rcu_state.jiffies_stall to two rcu check periods later, e.g. >>>>>>>>>>> jiffies + (rcu_jiffies_till_stall_check() * 2) in rcu_cpu_stall_reset() >>>>>>>>>>> to avoid this problem. But this isn't a complete solution because kgdb >>>>>>>>>>> may take a very long time in irq disabled context. >>>>>>>>>>> >>>>>>>>>>> Instead, update jiffies at the beginning of rcu_cpu_stall_reset() can >>>>>>>>>>> solve all kinds of problems. >>>>>>>>>> >>>>>>>>>> Would it make sense for there to be a kgdb_cpu_exit()? In that case, >>>>>>>>>> the stalls could simply be suppressed at the beginning of the debug >>>>>>>>>> session and re-enabled upon exit, as is currently done for sysrq output >>>>>>>>>> via rcu_sysrq_start() and rcu_sysrq_end(). >>>>>>>>> Thank you for your advice, but that doesn't help. Because >>>>>>>>> rcu_sysrq_start() and rcu_sysrq_end() try to suppress the warnings >>>>>>>>> during sysrq, but kgdb already has no warnings during kgdb_cpu_enter() >>>>>>>>> since it is executed in irq disabled context. Instead, this patch >>>>>>>>> wants to suppress the warnings *after* kgdb_cpu_enter() due to a very >>>>>>>>> old jiffies value. >>>>>>>>> >>>>>>>> >>>>>>>> Hello, Huacai >>>>>>>> >>>>>>>> Is it possible to set the rcu_cpu_stall_suppress is true in >>>>>>>> dbg_touch_watchdogs() >>>>>>>> and reset the rcu_cpu_stall_suppress at the beginning and end of the >>>>>>>> RCU grace period? >>>>>>> This is possible but not the best: 1, kgdb is not the only caller of >>>>>>> rcu_cpu_stall_reset(); 2, it is difficult to find the "end" to reset >>>>>>> rcu_cpu_stall_suppress. >>>>>>> >>>>>> >>>>>> You can replace rcu_state.jiffies_stall update by setting rcu_cpu_stall_suppress >>>>>> in rcu_cpu_stall_reset(), and reset rcu_cpu_stall_suppress in rcu_gp_init() and >>>>>> rcu_gp_cleanup(). >>>>> What's the advantage compared with updating jiffies? Updating jiffies >>>>> seems more straight forward. >>>>> >>>> >>>> In do_update_jiffies_64(), need to acquire jiffies_lock raw spinlock, >>>> like you said, kgdb is not the only caller of rcu_cpu_stall_reset(), >>>> the rcu_cpu_stall_reset() maybe invoke in NMI (arch/x86/platform/uv/uv_nmi.c) >>> Reset rcu_cpu_stall_suppress in rcu_gp_init()/rcu_gp_cleanup() is >>> still not so good to me, because it does a useless operation in most >>> cases. Moreover, the rcu core is refactored again and again, something >>> may be changed in future. >>> >>> If do_update_jiffies_64() cannot be used in NMI context, can we >> >> What about updating jiffies in dbg_touch_watchdogs or adding a wrapper which updates >> both jiffies and jiffies_stall? > This can solve the kgdb problem, but I found that most callers of > rcu_cpu_stall_reset() are in irq disabled context so they may meet The duration of other contexts where interrupts are disabled may not be as long as in the case of kgdb? > similar problems. Modifying rcu_cpu_stall_reset() can solve all of > them. > > But due to the NMI issue, from my point of view, setting jiffies_stall > to jiffies + 300*HZ is the best solution now. :) If I understand correctly, the NMI issue is the deadlock issue? If so, plus the short duration of other irq disabled contexts, it’s ok just update jiffies in dbg_touch_watchdogs. Please correct me if anything wrong. :) > > Huacai >> >>> consider my old method [1]? >>> https://lore.kernel.org/rcu/CAAhV-H7j9Y=VvRLm8thLw-EX1PGqBA9YfT4G1AN7ucYS=iP+DQ@xxxxxxxxxxxxxx/T/#t >>> >>> Of course we should set rcu_state.jiffies_stall large enough, so we >>> can do like this: >>> >>> void rcu_cpu_stall_reset(void) >>> { >>> WRITE_ONCE(rcu_state.jiffies_stall, >>> - jiffies + rcu_jiffies_till_stall_check()); >>> + jiffies + 300 * HZ); >>> } >>> >>> 300s is the largest timeout value, and I think 300s is enough here in practice. >>> >>> Huacai >>> >>>> >>>> Thanks >>>> Zqiang >>>> >>>> >>>>> Huacai >>>>> >>>>>> >>>>>> Thanks >>>>>> Zqiang >>>>>> >>>>>>> >>>>>>>> or set rcupdate.rcu_cpu_stall_suppress_at_boot=1 in bootargs can >>>>>>>> suppress RCU stall >>>>>>>> in booting. >>>>>>> This is also possible, but it suppresses all kinds of stall warnings, >>>>>>> which is not what we want. >>>>>>> >>>>>>> Huacai >>>>>>>> >>>>>>>> >>>>>>>> Thanks >>>>>>>> Zqiang >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> Huacai >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanx, Paul >>>>>>>>>> >>>>>>>>>>> Cc: stable@xxxxxxxxxxxxxxx >>>>>>>>>>> Fixes: a80be428fbc1f1f3bc9ed924 ("rcu: Do not disable GP stall detection in rcu_cpu_stall_reset()") >>>>>>>>>>> Reported-off-by: Binbin Zhou <zhoubinbin@xxxxxxxxxxx> >>>>>>>>>>> Signed-off-by: Huacai Chen <chenhuacai@xxxxxxxxxxx> >>>>>>>>>>> --- >>>>>>>>>>> kernel/rcu/tree_stall.h | 1 + >>>>>>>>>>> 1 file changed, 1 insertion(+) >>>>>>>>>>> >>>>>>>>>>> diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h >>>>>>>>>>> index b10b8349bb2a..1c7b540985bf 100644 >>>>>>>>>>> --- a/kernel/rcu/tree_stall.h >>>>>>>>>>> +++ b/kernel/rcu/tree_stall.h >>>>>>>>>>> @@ -153,6 +153,7 @@ static void panic_on_rcu_stall(void) >>>>>>>>>>> */ >>>>>>>>>>> void rcu_cpu_stall_reset(void) >>>>>>>>>>> { >>>>>>>>>>> + do_update_jiffies_64(ktime_get()); >>>>>>>>>>> WRITE_ONCE(rcu_state.jiffies_stall, >>>>>>>>>>> jiffies + rcu_jiffies_till_stall_check()); >>>>>>>>>>> } >>>>>>>>>>> -- >>>>>>>>>>> 2.39.3