On Thu, Feb 09, 2017 at 04:07:53PM +0100, Peter Zijlstra wrote: > On Wed, Feb 08, 2017 at 01:29:24PM +0000, Matt Fleming wrote: > > The calculation for the next sample window when exiting NOH_HZ idle > > does not handle the fact that we may not have reached the next sample > > window yet, i.e. that we came out of idle between sample windows. > > > > If we wake from NO_HZ idle after the pending this_rq->calc_load_update > > window time when we want idle but before the next sample window, we > > will add an unnecessary LOAD_FREQ delay to the load average > > accounting, delaying any update for potentially ~9seconds. > > > > This can result in huge spikes in the load average values due to > > per-cpu uninterruptible task counts being out of sync when accumulated > > across all CPUs. > > > > It's safe to update the per-cpu active count if we wake between sample > > windows because any load that we left in 'calc_load_idle' will have > > been zero'd when the idle load was folded in calc_global_load(). > > Right, so differently put; the problem is that we check against the > 'stale' rq->calc_load_update, while the current and effective period > boundary is 'calc_load_update'. > > So, when rq->calc_load_update < jiffies < calc_load_update, we end up > setting the next-update to calc_load_update+LOAD_FREQ, where it should > have been calc_load_update. > > > diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c > > index a2d6eb71f06b..a7a6f3646970 100644 > > --- a/kernel/sched/loadavg.c > > +++ b/kernel/sched/loadavg.c > > > @@ -210,10 +211,16 @@ void calc_load_exit_idle(void) > > * We woke inside or after the sample window, this means we're already > > * accounted through the nohz accounting, so skip the entire deal and > > * sync up for the next window. > > + * > > + * The next window is 'calc_load_update' if we haven't reached it yet, > > + * and 'calc_load_update + 10' if we're inside the current window. Hmm, the comment doesn't seem to match the code. > > */ > > + next_window = calc_load_update; > > + > > + if (time_in_range_open(jiffies, next_window, next_window + 10) > > + next_window += LOAD_FREQ; > > + > > + this_rq->calc_load_update = next_window; > > } > > So I don't much like the time_in_range_open() thing. The simpler patch > which you tested to also work was: > > diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c > index 7296b7308eca..cfb47bd0ee50 100644 > --- a/kernel/sched/loadavg.c > +++ b/kernel/sched/loadavg.c > @@ -201,6 +201,8 @@ void calc_load_exit_idle(void) > { > struct rq *this_rq = this_rq(); > > + this_rq->calc_load_update = calc_load_update; > + > /* > * If we're still before the sample window, we're done. > */ > @@ -212,7 +214,6 @@ void calc_load_exit_idle(void) > * accounted through the nohz accounting, so skip the entire deal and > * sync up for the next window. > */ > - this_rq->calc_load_update = calc_load_update; > if (time_before(jiffies, this_rq->calc_load_update + 10)) > this_rq->calc_load_update += LOAD_FREQ; > } > > But the problem there is that we unconditionally issue that store. Now > I've no idea how much of a problem that is, and it certainly is the > simplest form (+- comments that need updating), so maybe that makes > sense. Well, that version looks fine to me. Thanks.