On Sat, Oct 01, 2016 at 08:00:49PM -0400, Rich Felker wrote: > > > > > > - During the whole sequence, hrtimer expiration times are being set to > > > > > > exact jiffies (@ 100 Hz), whereas before it they're quite arbitrary. > > > > > > > > > > When a CPU goes into NOHZ idle and the next (timer/hrtimer) is farther out > > > > > than the next tick, then tick_sched_timer is set to this next event which > > > > > can be far out. So that's expected. > > > > > > > > > > > - The CLOCK_MONOTONIC hrtimer times do not match up with the > > > > > > timestamps; they're off by about 0.087s. I assume this is just > > > > > > sched_clock vs clocksource time and not a big deal. > > > > > > > > > > Yes. You can tell the tracer to use clock monotonic so then they should match. > > > > > > > > > > > - The rcu_sched process is sleeping with timeout=1. This seems > > > > > > odd/excessive. > > > > > > > > > > Why is that odd? That's one tick, i.e. 10ms in your case. And that's not > > > > > the problem at all. The problem is your timer not firing, but the cpu is > > > > > obviously either getting out of idle and then moves the tick ahead for some > > > > > unknown reason. > > > > > > > > And a one-jiffy timeout is in fact expected behavior when HZ=100. > > > > You have to be running HZ=250 or better to have two-jiffy timeouts, > > > > and HZ=500 or better for three-jiffy timeouts. > > > > > > One possible theory I'm looking at is that the two cpus are both > > > waking up (leaving cpu_idle_poll or cpuidle_idle_call) every jiffy > > > with sufficient consistency that every time the rcu_gp_fqs_check_wake > > > loop wakes up in rcu_gp_kthread, the other cpu is in cpu_idle_loop but > > > outside the rcu_idle_enter/rcu_idle_exit range. Would this block > > > forward process? I added an LED indicator in rcu_gp_fqs_check_wake > > > that shows the low 2 bits of rnp->qsmask every time it's called, and > > > under normal operation the LEDs just flash on momentarily or just one > > > stays on for a few seconds then goes off. During a stall both are > > > stuck on. I'm still trying to make sense of the code but my impression > > > so far is that, on a 2-cpu machine, this is a leaf node and the 2 bits > > > correspond directly to cpus; is that right? If so I'm a bit confused > > > because I don't see how forward progress could ever happen if the cpu > > > on which rcu_gp_kthread is blocking forward progress of > > > rcu_gp_kthread. > > > > No. If the CPUs are entering and leaving idle, and if your timers > > were waking up rcu_sched every few jiffies like it asks, then the > > repeated idle entry/exit events would be noticed, courtesy of the atomic > > increments of ->dynticks and the rcu_sched kthread's snapshotting and > > checking of this value. > > I don't see how rcu_sched could notice the changes if it's stuck in > the wait loop I think it's stuck in. There is no check of ->dynticks > in rcu_gp_fqs_check_wake. Just in case checking *gfp & RCU_GP_FLAG_FQS > accomplishes this, I updated my LED hacks to clear the LEDs in that > exit path (and killed the other place that could turn them back on > from cpu_idle_loop) but I still get 2 LEDs on for 21s followed by a > stall message. > > > Even if the CPUs were always non-idle on every > > time force_quiescent_state() is invoked, give or take the possibility > > of counter wrap -- but even on a 32-bit system, that takes awhile. > > Perhaps force_quiescent_state is not getting invoked? Does that sound > plausible, and if so, how should I go about debugging it? I see it is > called from the stall reporting code, so that's presumably what's > breaking the stalls. I can confirm that force_quiescent_state is not being called at all except from the stall handler. Where is it supposed to be called from? I can't find any code paths to it except the stall handler and __call_rcu_core, but the latter seems to only be called when adding new rcu callbacks, not as a response to a stalled rcu_sched thread. Maybe I'm missing something but this seems like incorrect logic in the rcu subsystem. Rich -- To unsubscribe from this list: send the line "unsubscribe devicetree" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html