This is a note to let you know that I've just added the patch titled sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting to the 4.9-stable tree which can be found at: http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary The filename of the patch is: sched-loadavg-avoid-loadavg-spikes-caused-by-delayed-no_hz-accounting.patch and it can be found in the queue-4.9 subdirectory. If you, or anyone else, feels it should not be added to the stable tree, please let <stable@xxxxxxxxxxxxxxx> know about it. >From 6e5f32f7a43f45ee55c401c0b9585eb01f9629a8 Mon Sep 17 00:00:00 2001 From: Matt Fleming <matt@xxxxxxxxxxxxxxxxxxx> Date: Fri, 17 Feb 2017 12:07:30 +0000 Subject: sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting From: Matt Fleming <matt@xxxxxxxxxxxxxxxxxxx> commit 6e5f32f7a43f45ee55c401c0b9585eb01f9629a8 upstream. If we crossed a sample window while in NO_HZ we will add LOAD_FREQ to the pending sample window time on exit, setting the next update not one window into the future, but two. This situation on exiting NO_HZ is described by: this_rq->calc_load_update < jiffies < calc_load_update In this scenario, what we should be doing is: this_rq->calc_load_update = calc_load_update [ next window ] But what we actually do is: this_rq->calc_load_update = calc_load_update + LOAD_FREQ [ next+1 window ] This has the effect of delaying load average updates for potentially up to ~9seconds. This can result in huge spikes in the load average values due to per-cpu uninterruptible task counts being out of sync when accumulated across all CPUs. It's safe to update the per-cpu active count if we wake between sample windows because any load that we left in 'calc_load_idle' will have been zero'd when the idle load was folded in calc_global_load(). This issue is easy to reproduce before, commit 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking") just by forking short-lived process pipelines built from ps(1) and grep(1) in a loop. I'm unable to reproduce the spikes after that commit, but the bug still seems to be present from code review. Signed-off-by: Matt Fleming <matt@xxxxxxxxxxxxxxxxxxx> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx> Cc: Frederic Weisbecker <fweisbec@xxxxxxxxx> Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> Cc: Mike Galbraith <efault@xxxxxx> Cc: Mike Galbraith <umgwanakikbuti@xxxxxxxxx> Cc: Morten Rasmussen <morten.rasmussen@xxxxxxx> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx> Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx> Fixes: commit 5167e8d ("sched/nohz: Rewrite and fix load-avg computation -- again") Link: http://lkml.kernel.org/r/20170217120731.11868-2-matt@xxxxxxxxxxxxxxxxxxx Signed-off-by: Ingo Molnar <mingo@xxxxxxxxxx> Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx> --- kernel/sched/loadavg.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) --- a/kernel/sched/loadavg.c +++ b/kernel/sched/loadavg.c @@ -201,8 +201,9 @@ void calc_load_exit_idle(void) struct rq *this_rq = this_rq(); /* - * If we're still before the sample window, we're done. + * If we're still before the pending sample window, we're done. */ + this_rq->calc_load_update = calc_load_update; if (time_before(jiffies, this_rq->calc_load_update)) return; @@ -211,7 +212,6 @@ void calc_load_exit_idle(void) * accounted through the nohz accounting, so skip the entire deal and * sync up for the next window. */ - this_rq->calc_load_update = calc_load_update; if (time_before(jiffies, this_rq->calc_load_update + 10)) this_rq->calc_load_update += LOAD_FREQ; } Patches currently in stable-queue which might be from matt@xxxxxxxxxxxxxxxxxxx are queue-4.9/sched-loadavg-avoid-loadavg-spikes-caused-by-delayed-no_hz-accounting.patch