Hi Paul On Sat, 22 Sep 2012, Paul Walmsley wrote: > On Sat, 22 Sep 2012, Paul E. McKenney wrote: > > > And here is a patch. I am still having trouble reproducing the problem, > > but figured that I should avoid serializing things. > > Thanks, testing this now on v3.6-rc6. Looks like you solved it! Tested v3.6-rc6 + your stall diagnostic patch: http://marc.info/?l=linux-arm-kernel&m=134827237215882&w=2 on OMAP4430ES2 Pandaboard using omap2plus_defconfig and CONFIG_RCU_CPU_STALL_INFO=y; got the stall warnings. Then added "rcu: Fix day-one dyntick-idle stall-warning bug" from: http://marc.info/?l=linux-arm-kernel&m=134835120600590&w=2 Booted that, and the stall warnings did not appear within 30 minutes. To confirm that the problem being solved matched your hypothesis, the debugging patch below[1] was added to the RCU idle entry/exit code. Without the bugfix patch, a boot log transcript was obtained indicating that the idle loop was entered with tick_nohz_enabled=1 during a grace period with no callbacks present: http://www.pwsan.com/omap/transcripts/20120922-rcu-stall-debug-pre-fix.txt The debugging events started to appear at 1.867370 seconds into the boot. ENTER was pressed about 464 seconds in; this triggered the rcu_sched stall traceback. With the bugfix patch, a boot log transcript was obtained that indicated that the condition under test never occurred after waiting about 20 minutes: http://www.pwsan.com/omap/transcripts/20120922-rcu-stall-debug-post-fix.txt Thanks for being so willing to root-cause the issue, Paul; it's appreciated, and it's been quite instructive as well. Will address some remaining loose ends in follow-up E-mails. - Paul [1] Debugging patch to printk() if the previous idle loop entry occurred with tick_nohz_enabled=1 during a grace period with no RCU callbacks present: --- kernel/rcutree.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/kernel/rcutree.c b/kernel/rcutree.c index f1eb7ad..f42941b 100644 --- a/kernel/rcutree.c +++ b/kernel/rcutree.c @@ -60,6 +60,9 @@ /* Data structures. */ +extern int tick_nohz_enabled; +static int no_cbs_idle_entry_count; + static struct lock_class_key rcu_node_class[RCU_NUM_LVLS]; #define RCU_STATE_INITIALIZER(sname, cr) { \ @@ -400,8 +403,12 @@ void rcu_idle_enter(void) unsigned long flags; long long oldval; struct rcu_dynticks *rdtp; + int cpu; + long totqlen = 0; + struct rcu_data *rdp; local_irq_save(flags); + rdp = &__get_cpu_var(rcu_sched_data); rdtp = &__get_cpu_var(rcu_dynticks); oldval = rdtp->dynticks_nesting; WARN_ON_ONCE((oldval & DYNTICK_TASK_NEST_MASK) == 0); @@ -410,6 +417,12 @@ void rcu_idle_enter(void) else rdtp->dynticks_nesting -= DYNTICK_TASK_NEST_VALUE; rcu_idle_enter_common(rdtp, oldval); + if (tick_nohz_enabled && rcu_gp_in_progress(rdp->rsp)) { + for_each_possible_cpu(cpu) + totqlen += per_cpu_ptr(rdp->rsp->rda, cpu)->qlen; + if (totqlen == 0) + no_cbs_idle_entry_count = 1; + } local_irq_restore(flags); } EXPORT_SYMBOL_GPL(rcu_idle_enter); @@ -503,6 +516,10 @@ void rcu_idle_exit(void) rdtp->dynticks_nesting = DYNTICK_TASK_EXIT_IDLE; rcu_idle_exit_common(rdtp, oldval); local_irq_restore(flags); + if (no_cbs_idle_entry_count) { + no_cbs_idle_entry_count = 0; + pr_err("* Tickless idle was entered with zero RCU callbacks\n"); + } } EXPORT_SYMBOL_GPL(rcu_idle_exit); -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-omap" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html