On Thu, May 03, 2012 at 05:55:57PM +0300, Gilad Ben-Yossef wrote: > Current timer code fails to correctly return a value meaning > that there is no future timer event, with the result that > the timer keeps getting re-armed in HZ one shot mode even > when we could turn it off, generating unneeded interrupts. > This patch attempts to fix this problem. > > What is happening is that when __next_timer_interrupt() wishes > to return a value that signifies "there is no future timer > event", it returns (base->timer_jiffies + NEXT_TIMER_MAX_DELTA). > > However, the code in tick_nohz_stop_sched_tick(), which called > __next_timer_interrupt() via get_next_timer_interrupt(), > compares the return value to (last_jiffies + NEXT_TIMER_MAX_DELTA) > to see if the timer needs to be re-armed. > > base->timer_jiffies != last_jiffies and so > tick_nohz_stop_sched_tick() interperts the return value as > indication that there is a distant future event 12 days > from now and programs the timer to fire next after KIME_MAX > nsecs instead of avoiding to arm it. This ends up causesing > a needless interrupt once every KTIME_MAX nsecs. Good catch! So if I understand correctly, base->timer_jiffies can be backward compared to last_jiffies. If we return base->timer_jiffies + NEXT_TIMER_MAX_DELTA, the next_jiffies - last_jiffies diff gives a delta that is a bit before NEXT_TIMER_MAX_DELTA. And this can indeed happen if we haven't got any timer list executed since we updated jiffies last, timer_jiffies can be a backward compared to last_jiffies. This is harmless but causes needless timers. I just have small comment below: > > I've noticed a similar but slightly different fix to the > same problem in the Tilera kernel tree from Chris M. (I've > wrote this before seeing that one), so some variation of this > fix is in use on real hardware for some time now. > > Signed-off-by: Gilad Ben-Yossef <gilad@xxxxxxxxxxxxx> > CC: Thomas Gleixner <tglx@xxxxxxxxxxxxx> > CC: Tejun Heo <tj@xxxxxxxxxx> > CC: John Stultz <johnstul@xxxxxxxxxx> > CC: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > CC: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxxxxxxx> > CC: Mel Gorman <mel@xxxxxxxxx> > CC: Mike Frysinger <vapier@xxxxxxxxxx> > CC: David Rientjes <rientjes@xxxxxxxxxx> > CC: Hugh Dickins <hughd@xxxxxxxxxx> > CC: Minchan Kim <minchan.kim@xxxxxxxxx> > CC: Konstantin Khlebnikov <khlebnikov@xxxxxxxxxx> > CC: Christoph Lameter <cl@xxxxxxxxx> > CC: Chris Metcalf <cmetcalf@xxxxxxxxxx> > CC: Hakan Akkan <hakanakkan@xxxxxxxxx> > CC: Max Krasnyansky <maxk@xxxxxxxxxxxx> > CC: Frederic Weisbecker <fweisbec@xxxxxxxxx> > CC: linux-kernel@xxxxxxxxxxxxxxx > CC: linux-mm@xxxxxxxxx > --- > kernel/timer.c | 31 +++++++++++++++++++++---------- > 1 files changed, 21 insertions(+), 10 deletions(-) > > diff --git a/kernel/timer.c b/kernel/timer.c > index a297ffc..32ba64a 100644 > --- a/kernel/timer.c > +++ b/kernel/timer.c > @@ -1187,11 +1187,13 @@ static inline void __run_timers(struct tvec_base *base) > * is used on S/390 to stop all activity when a CPU is idle. > * This function needs to be called with interrupts disabled. > */ > -static unsigned long __next_timer_interrupt(struct tvec_base *base) > +static bool __next_timer_interrupt(struct tvec_base *base, > + unsigned long *next_timer) > { > unsigned long timer_jiffies = base->timer_jiffies; > unsigned long expires = timer_jiffies + NEXT_TIMER_MAX_DELTA; > - int index, slot, array, found = 0; > + int index, slot, array; > + bool found = false; > struct timer_list *nte; > struct tvec *varray[4]; > > @@ -1202,12 +1204,12 @@ static unsigned long __next_timer_interrupt(struct tvec_base *base) > if (tbase_get_deferrable(nte->base)) > continue; > > - found = 1; > + found = true; > expires = nte->expires; > /* Look at the cascade bucket(s)? */ > if (!index || slot < index) > goto cascade; > - return expires; > + goto out; > } > slot = (slot + 1) & TVR_MASK; > } while (slot != index); > @@ -1233,7 +1235,7 @@ cascade: > if (tbase_get_deferrable(nte->base)) > continue; > > - found = 1; > + found = true; > if (time_before(nte->expires, expires)) > expires = nte->expires; > } > @@ -1245,7 +1247,7 @@ cascade: > /* Look at the cascade bucket(s)? */ > if (!index || slot < index) > break; > - return expires; > + goto out; > } > slot = (slot + 1) & TVN_MASK; > } while (slot != index); > @@ -1254,7 +1256,10 @@ cascade: > timer_jiffies += TVN_SIZE - index; > timer_jiffies >>= TVN_BITS; > } > - return expires; > +out: > + if (found) > + *next_timer = expires; > + return found; > } > > /* > @@ -1317,9 +1322,15 @@ unsigned long get_next_timer_interrupt(unsigned long now) > if (cpu_is_offline(smp_processor_id())) > return now + NEXT_TIMER_MAX_DELTA; > spin_lock(&base->lock); > - if (time_before_eq(base->next_timer, base->timer_jiffies)) > - base->next_timer = __next_timer_interrupt(base); > - expires = base->next_timer; > + if (time_before_eq(base->next_timer, base->timer_jiffies)) { > + > + if (__next_timer_interrupt(base, &expires)) > + base->next_timer = expires; > + else > + expires = now + NEXT_TIMER_MAX_DELTA; I believe you can update base->next_timer to now + NEXT_TIMER_MAX_DELTA, so on any further idle interrupt exit that call tick_nohz_stop_sched_tick(), we won't get again the overhead of __next_timer_interrupt(). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>