On Wed, Feb 08, 2017 at 01:29:24PM +0000, Matt Fleming wrote: > The calculation for the next sample window when exiting NOH_HZ idle > does not handle the fact that we may not have reached the next sample > window yet, i.e. that we came out of idle between sample windows. > > If we wake from NO_HZ idle after the pending this_rq->calc_load_update > window time when we want idle but before the next sample window, we > will add an unnecessary LOAD_FREQ delay to the load average > accounting, delaying any update for potentially ~9seconds. > > This can result in huge spikes in the load average values due to > per-cpu uninterruptible task counts being out of sync when accumulated > across all CPUs. > > It's safe to update the per-cpu active count if we wake between sample > windows because any load that we left in 'calc_load_idle' will have > been zero'd when the idle load was folded in calc_global_load(). Right, so differently put; the problem is that we check against the 'stale' rq->calc_load_update, while the current and effective period boundary is 'calc_load_update'. So, when rq->calc_load_update < jiffies < calc_load_update, we end up setting the next-update to calc_load_update+LOAD_FREQ, where it should have been calc_load_update. > diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c > index a2d6eb71f06b..a7a6f3646970 100644 > --- a/kernel/sched/loadavg.c > +++ b/kernel/sched/loadavg.c > @@ -210,10 +211,16 @@ void calc_load_exit_idle(void) > * We woke inside or after the sample window, this means we're already > * accounted through the nohz accounting, so skip the entire deal and > * sync up for the next window. > + * > + * The next window is 'calc_load_update' if we haven't reached it yet, > + * and 'calc_load_update + 10' if we're inside the current window. > */ > + next_window = calc_load_update; > + > + if (time_in_range_open(jiffies, next_window, next_window + 10) > + next_window += LOAD_FREQ; > + > + this_rq->calc_load_update = next_window; > } So I don't much like the time_in_range_open() thing. The simpler patch which you tested to also work was: diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c index 7296b7308eca..cfb47bd0ee50 100644 --- a/kernel/sched/loadavg.c +++ b/kernel/sched/loadavg.c @@ -201,6 +201,8 @@ void calc_load_exit_idle(void) { struct rq *this_rq = this_rq(); + this_rq->calc_load_update = calc_load_update; + /* * If we're still before the sample window, we're done. */ @@ -212,7 +214,6 @@ void calc_load_exit_idle(void) * accounted through the nohz accounting, so skip the entire deal and * sync up for the next window. */ - this_rq->calc_load_update = calc_load_update; if (time_before(jiffies, this_rq->calc_load_update + 10)) this_rq->calc_load_update += LOAD_FREQ; } But the problem there is that we unconditionally issue that store. Now I've no idea how much of a problem that is, and it certainly is the simplest form (+- comments that need updating), so maybe that makes sense. Alternatively, something like: diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c index 7296b7308eca..3dd4ce6fe151 100644 --- a/kernel/sched/loadavg.c +++ b/kernel/sched/loadavg.c @@ -207,12 +207,15 @@ void calc_load_exit_idle(void) if (time_before(jiffies, this_rq->calc_load_update)) return; + this_rq->calc_load_update = calc_load_update; + if (time_before(jiffies, this_rq->calc_load_update)) + return; + /* * We woke inside or after the sample window, this means we're already * accounted through the nohz accounting, so skip the entire deal and * sync up for the next window. */ - this_rq->calc_load_update = calc_load_update; if (time_before(jiffies, this_rq->calc_load_update + 10)) this_rq->calc_load_update += LOAD_FREQ; } might be another solution. Irrespective the above though; should we not make this: + this_rq->calc_load_update = READ_ONCE(calc_load_update); because if for some reason we do a double load of calc_load_update and see two different values, weird stuff could happen. And because, on general principle, a READ_ONCE() should be paired with a WRITE_ONCE(), that should be done too I suppose. -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html