2017-02-17 20:07 GMT+08:00 Matt Fleming <matt@xxxxxxxxxxxxxxxxxxx>: > If we crossed a sample window while in NO_HZ we will add LOAD_FREQ to > the pending sample window time on exit, setting the next update not > one window into the future, but two. > > This situation on exiting NO_HZ is described by: > > this_rq->calc_load_update < jiffies < calc_load_update > > In this scenario, what we should be doing is: > > this_rq->calc_load_update = calc_load_update [ next window ] > > But what we actually do is: > > this_rq->calc_load_update = calc_load_update + LOAD_FREQ [ next+1 window ] > > This has the effect of delaying load average updates for potentially > up to ~9seconds. > > This can result in huge spikes in the load average values due to > per-cpu uninterruptible task counts being out of sync when accumulated > across all CPUs. > > It's safe to update the per-cpu active count if we wake between sample > windows because any load that we left in 'calc_load_idle' will have > been zero'd when the idle load was folded in calc_global_load(). > > This issue is easy to reproduce before, > > commit 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking") > > just by forking short-lived process pipelines built from ps(1) and > grep(1) in a loop. I'm unable to reproduce the spikes after that > commit, but the bug still seems to be present from code review. > > Fixes: commit 5167e8d ("sched/nohz: Rewrite and fix load-avg computation -- again") > Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > Cc: Mike Galbraith <umgwanakikbuti@xxxxxxxxx> > Cc: Morten Rasmussen <morten.rasmussen@xxxxxxx> > Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx> > Cc: Frederic Weisbecker <fweisbec@xxxxxxxxx> > Cc: <stable@xxxxxxxxxxxxxxx> # v3.5+ > Signed-off-by: Matt Fleming <matt@xxxxxxxxxxxxxxxxxxx> Reviewed-by: Wanpeng Li <wanpeng.li@xxxxxxxxxxx> > --- > kernel/sched/loadavg.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > Changes in v2: > > - Folded in Peter's suggestion for how to fix this. > > - Tried to clairfy the changelog based on feedback from Peter and > Frederic > > diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c > index a2d6eb71f06b..ec91fcc09bfe 100644 > --- a/kernel/sched/loadavg.c > +++ b/kernel/sched/loadavg.c > @@ -201,8 +201,9 @@ void calc_load_exit_idle(void) > struct rq *this_rq = this_rq(); > > /* > - * If we're still before the sample window, we're done. > + * If we're still before the pending sample window, we're done. > */ > + this_rq->calc_load_update = calc_load_update; > if (time_before(jiffies, this_rq->calc_load_update)) > return; > > @@ -211,7 +212,6 @@ void calc_load_exit_idle(void) > * accounted through the nohz accounting, so skip the entire deal and > * sync up for the next window. > */ > - this_rq->calc_load_update = calc_load_update; > if (time_before(jiffies, this_rq->calc_load_update + 10)) > this_rq->calc_load_update += LOAD_FREQ; > } > -- > 2.10.0 >