This patch is 2/2 of v4 patch set to fix an issue that idle/iowait of /proc/stat can go backward. Originally reported by Tetsuo and Fernando at last year, Mar 2013. (Continued from previous patch 1/2.) [PROBLEM 2]: broken iowait accounting. As historical nature, cpu's idle time was accounted as either idle or iowait depending on the presence of tasks blocked by I/O. No one complain about it for a long time. However: > Still trying to wrap my head around it, but conceptually > get_cpu_iowait_time_us() doesn't make any kind of sense. > iowait isn't per cpu since effectively tasks that aren't > running aren't assigned a cpu (as Oleg already pointed out). -- Peter Zijlstra Now some kernel folks realized that accounting iowait as per-cpu does not make sense in SMP world. When we were in traditional UP era, cpu is considered to be waiting I/O if it is idle while nr_iowait > 0. But in these days with SMP systems, tasks easily migrate from a cpu where they issued an I/O to another cpu where they are queued after I/O completion. Back to NO_HZ mechanism. Totally terrible thing here is that observer need to handle duration "delta" without knowing that nr_iowait of sleeping cpu can be changed easily by woken tasks even if cpu is sleeping. So it can happen that: given: idle time stats: idle=1000, iowait=100 stamp at idle entry: entry=50 nr tasks waiting io: nr_iowait=1 observer temporary assigns delta as iowait at 1st place, (but does not do update (=account delta to time stats)): 1st reader's query @ now = 60: idle=1000 iowait=110 (=100+(60-50)) then blocked task is queued to runqueue of other active cpu, woken up from io_schedule{,_timeout}() and do atomic_dec() from the remote: nr_iowait=0 and at last in 2nd turn observer assign delta as idle: 2nd reader's query @ now = 70: idle=1020 (=1000+(70-50)) iowait=100 You will see that iowait is decreased from 110 to 100. In summary, the original problem that iowait of /proc/stats can go backward is a kind of hidden bug, while at the same time iowait accounting has fundamental problem and needs to be precisely reworked. [TARGET OF THIS PATCH]: Complete rework for iowait accounting implies that some user interfaces might be replaced completely. It will introduce new counter or so, and kill per-cpu iowait counter which is known to being nonsense. It will take long time to be achieved, considering time to make the problem to a public knowledge, and also time for interface transition. Anyway as long as linux try to be reliable OS, such drastic change must be placed on mainline kernel in development sooner or later. OTOH, drastic change like above is not acceptable for conservative environment, such as stable kernels, some distributor's kernel and so on, due to respect for compatibility. Still these kernel expects per-cpu iowait counter works as well as it might have been. Therefore for such environment, applying a simple patch to fix behavior of counter will be appreciated rather than replacing an over-used interface or changing an existing usage/semantics. This patch targets latter kernels mainly. It does not do too much, but intend to fix the idle stats counters to be monotonic. I believe that mainline kernel should pick up this patch too, because it surely fix a hidden bug and does not intercept upcoming drastic change. Again, in summary, this patch does not do drastic change to cope with problem 2. But it just fix behavior of idle/iowait counter of /proc/stats. [WHAT THIS PATCH PROPOSED]: The main reason of the bug is that observers have no idea to determine the delta to be idle or iowait at the first place. So I introduced delayed iowait accounting to allow observers to assign delta based on status of observed cpu at idle entry. Refer comment in patch for the detail. [References]: First report from Fernando: [RFC] iowait/idle time accounting hiccups in NOHZ kernels https://lkml.org/lkml/2013/3/18/962 Steps to reproduce guided by Tetsuo: https://lkml.org/lkml/2013/4/1/201 1st patchset from Frederic: [PATCH 0/4] nohz: Fix racy sleeptime stats https://lkml.org/lkml/2013/8/8/638 [PATCH RESEND 0/4] nohz: Fix racy sleeptime stats https://lkml.org/lkml/2013/8/16/274 2nd patchset from Frederic: [RFC PATCH 0/5] nohz: Fix racy sleeptime stats v2 https://lkml.org/lkml/2013/10/19/86 My previous patch set: [PATCH 0/2] nohz: fix idle accounting in NO_HZ kernels https://lkml.org/lkml/2014/3/23/256 [PATCH 0/2 v2] nohz: fix idle accounting in NO_HZ kernels https://lkml.org/lkml/2014/3/30/315 [PATCH v3 0/2] nohz: fix idle accounting in NO_HZ kernels https://lkml.org/lkml/2014/4/10/133 v4: rework delayed iowait accounting: check nr_iowait at entry (not to account iowait at out of io boundary) update description/comments to explain more details v3: use seqcount instead of seqlock (achieved by inserting cleanup as former patch) plus introduce delayed iowait accounting v2: update comments and description about problem 2. include fix for minor typo Signed-off-by: Hidetoshi Seto <seto.hidetoshi@xxxxxxxxxxxxxx> Reported-by: Fernando Luis Vazquez Cao <fernando_b1@xxxxxxxxxxxxx> Reported-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> Cc: Frederic Weisbecker <fweisbec@xxxxxxxxx> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx> Cc: Ingo Molnar <mingo@xxxxxxxxxx> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> Cc: Arjan van de Ven <arjan@xxxxxxxxxxxxxxx> Cc: Oleg Nesterov <oleg@xxxxxxxxxx> Cc: Preeti U Murthy <preeti@xxxxxxxxxxxxxxxxxx> Cc: Denys Vlasenko <vda.linux@xxxxxxxxxxxxxx> Cc: <stable@xxxxxxxxxxxxxxx> --- include/linux/tick.h | 1 + kernel/time/tick-sched.c | 122 +++++++++++++++++++++++++++++++++++++++------ 2 files changed, 106 insertions(+), 17 deletions(-) diff --git a/include/linux/tick.h b/include/linux/tick.h index 00dd98d..cec32e4 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -68,6 +68,7 @@ struct tick_sched { ktime_t idle_exittime; ktime_t idle_sleeptime; ktime_t iowait_sleeptime; + ktime_t iowait_pending; ktime_t sleep_length; unsigned long last_jiffies; unsigned long next_jiffies; diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index c8314a1..95e18bd 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -405,16 +405,90 @@ static void tick_nohz_update_jiffies(ktime_t now) static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now) { + static const ktime_t ktime_zero = { .tv64 = 0 }; ktime_t delta; write_seqcount_begin(&ts->idle_sleeptime_seq); /* Updates the per cpu time idle statistics counters */ delta = ktime_sub(now, ts->idle_entrytime); - if (nr_iowait_cpu(smp_processor_id()) > 0) - ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta); - else + + if (ts->idle_active == 1) { + /* delta is idle */ ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta); + } else { + /* + * delta is sum of idle/iowait: + * + * We account sleep time as iowait when nr_iowait of cpu + * indicates there are tasks blocked by io, at the end of + * idle (=here). + * It means observers (=who query sleeping cpu's idle stats) + * can not determine whether the unaccounted sleep time will + * be idle or iowait on the fly. + * + * Therefore introduce following mechanism: + * + * [PHASE_1]: assign delta to idle + * - pending is 0 + * - observers assign delta to idle + * - when cpu wakes up and come here: + * - If nr_iowait is 0: + * Account delta as idle, and continue PHASE_1. + * - If nr_iowait is >0: + * We'd like to account delta to iowait but it will + * be inconsistent with observers who already assigned + * a part of delta as idle. So here we account delta + * to idle and also pending as missed iowait. Then + * move to PHASE_2 from next turn. + * + * [PHASE_2]: we have missed iowait + * - pending is >0 + * - observers assign delta to iowait until pending and over + * to idle. + * - when cpu wakes up and come here: + * - Account delta to iowait until pending and over to idle + * (=in same manner as observers). + * - Subtract accounted iowait from pending. + * - If nr_iowait is >0: + * The sleep time just finished was used to account + * pending iowait, so now we got new missed iowait. + * Accumulate delta as missed iowait, and wait next + * turn to let it be accounted. Continue PHASE_2. + * - If nr_iowait is 0: + * The sleep time just finished was considered as idle + * but utilized to account missed iowait. It is not + * problem because we just exchange excess idle for + * missed iowait. If we still have pending continue + * PHASE_2, otherwise move to PHASE_1 from next turn. + */ + ktime_t *idle = &ts->idle_sleeptime; + ktime_t *iowait = &ts->iowait_sleeptime; + ktime_t *pending = &ts->iowait_pending; + + if (ktime_compare(*pending, ktime_zero) == 0) { + /* PHASE_1 */ + *idle = ktime_add(*idle, delta); + if (nr_iowait_cpu(smp_processor_id()) > 0) + *pending = delta; + } else { + /* PHASE_2 */ + if (ktime_compare(*pending, delta) > 0) { + *iowait = ktime_add(*iowait, delta); + if (!nr_iowait_cpu(smp_processor_id())) + *pending = ktime_sub(*pending, delta); + } else { + *idle = ktime_add(*idle, ktime_sub(delta, + *pending)); + *iowait = ktime_add(*iowait, *pending); + if (nr_iowait_cpu(smp_processor_id()) > 0) + *pending = delta; + else + *pending = ktime_zero; + } + } + } + ts->idle_entrytime = now; ts->idle_active = 0; @@ -429,7 +503,13 @@ static ktime_t tick_nohz_start_idle(struct tick_sched *ts) write_seqcount_begin(&ts->idle_sleeptime_seq); ts->idle_entrytime = now; - ts->idle_active = 1; + /* + * idle_active: + * 0: cpu is not idle + * 1: cpu is performing idle + * 2: cpu is performing iowait and idle + */ + ts->idle_active = nr_iowait_cpu(smp_processor_id()) ? 2 : 1; write_seqcount_end(&ts->idle_sleeptime_seq); sched_clock_idle_sleep_event(); @@ -464,18 +544,23 @@ u64 get_cpu_idle_time_us(int cpu, u64 *wall) do { seq = read_seqcount_begin(&ts->idle_sleeptime_seq); - - if (ts->idle_active && !nr_iowait_cpu(cpu)) { - ktime_t delta = ktime_sub(now, ts->idle_entrytime); + idle = ts->idle_sleeptime; - idle = ktime_add(ts->idle_sleeptime, delta); - } else { - idle = ts->idle_sleeptime; + if (ts->idle_active == 2) { + /* delta is sum of unaccounted idle/iowait */ + ktime_t delta = ktime_sub(now, ts->idle_entrytime); + if (ktime_compare(delta, ts->iowait_pending) > 0) { + delta = ktime_sub(delta, ts->iowait_pending); + idle = ktime_add(idle, delta); + } + } else if (ts->idle_active == 1) { + /* delta is unaccounted idle */ + ktime_t delta = ktime_sub(now, ts->idle_entrytime); + idle = ktime_add(idle, delta); } } while (read_seqcount_retry(&ts->idle_sleeptime_seq, seq)); return ktime_to_us(idle); - } EXPORT_SYMBOL_GPL(get_cpu_idle_time_us); @@ -507,13 +592,16 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *wall) do { seq = read_seqcount_begin(&ts->idle_sleeptime_seq); - - if (ts->idle_active && nr_iowait_cpu(cpu) > 0) { - ktime_t delta = ktime_sub(now, ts->idle_entrytime); + iowait = ts->iowait_sleeptime; - iowait = ktime_add(ts->iowait_sleeptime, delta); - } else { - iowait = ts->iowait_sleeptime; + if (ts->idle_active == 2) { + /* delta is sum of unaccounted idle/iowait */ + ktime_t delta = ktime_sub(now, ts->idle_entrytime); + if (ktime_compare(delta, ts->iowait_pending) > 0) { + iowait = ktime_add(iowait, ts->iowait_pending); + } else { + iowait = ktime_add(iowait, delta); + } } } while (read_seqcount_retry(&ts->idle_sleeptime_seq, seq)); -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html