(2014/04/15 19:04), Peter Zijlstra wrote: > On Thu, Apr 10, 2014 at 06:13:54PM +0900, Hidetoshi Seto wrote: >> This patch is v3 of patch set to fix an issue that idle/iowait >> of /proc/stat can go backward. Originally reported by Tetsuo and >> Fernando at last year, Mar 2013. >> >> [BACKGROUNDS]: idle accounting on NO_HZ >> >> If NO_HZ is enabled, cpu stops tick interrupt for itself before >> go sleep to be idle. It means that time stats of the sleeping cpu >> will not be updated in tick interrupt. Instead when cpu wakes up, >> it updates time stats by calculating idle duration time from >> timestamp at entering idle and current time as exiting idle. >> >> OTOH, it can happen that there are some kind of observer who want >> to know how long the sleeping cpu have been idle. Imagine that >> userland runs top command or application who read /proc/stats. >> Therefore kernel provides get_cpu_{idle,iowait}_time_us() function >> for user to obtain current idle time stats of such sleeping cpu. >> This function reads time stats and timestamp at entering idle, >> and then return current idle time by adding duration calculated >> from timestamp and current time. >> >> There are 2 problems: >> >> [PROBLEM 1]: there is no exclusive control. >> >> It is easy to understand that there are 2 different cpu - an >> observing cpu where running a program observing idle cpu's stat >> and an observed cpu where performing idle. It means race can >> happen if observed cpu wakes up while it is observed. Observer >> can accidentally add calculated duration time (say delta) to >> time stats which is just updated by woken cpu. Soon correct >> idle time is returned in next turn, so it will result in >> backward time. Therefore readers must be excluded by writer. >> >> My former patch happily changes get_cpu_{idle,iowait}_time_us() >> not to update sleeping cpu's time stats from observing cpu. >> It makes time stats to be updated by woken cpu only, so there >> are only one writer now! >> >> In summary there are races between one writer and multiple >> reader but no exclusive control on this idle time stats dataset. > > This should've gone in the previous patch; as it describes what that > does. Yes, I've being lazy... I'll fix it. >> >> [PROBLEM 2]: broken iowait accounting. >> >> As historical nature, cpu's idle time was accounted as either >> idle or iowait depending on the presence of tasks blocked by >> I/O. No one complain about it for a long time. However: >> >> > Still trying to wrap my head around it, but conceptually >> > get_cpu_iowait_time_us() doesn't make any kind of sense. >> > iowait isn't per cpu since effectively tasks that aren't >> > running aren't assigned a cpu (as Oleg already pointed out). >> -- Peter Zijlstra >> >> Now some kernel folks realized that accounting iowait as per-cpu >> does not make sense in SMP world. When we were in traditional >> UP era, cpu is considered to be waiting I/O if it is idle while >> nr_iowait > 0. But in these days with SMP systems, tasks easily >> migrate from a cpu where they issued an I/O to another cpu where >> they are queued after I/O completion. >> >> Back to NO_HZ mechanism. Totally terrible thing here is that >> observer need to handle duration "delta" without knowing that >> nr_iowait of sleeping cpu can be changed easily by migration >> even if cpu is sleeping. So it can happen that: >> >> given: >> idle time stats: idle=1000, iowait=100 >> stamp at idle entry: entry=50 >> nr tasks waiting io: nr_iowait=1 >> >> observer temporary assigns delta as iowait at 1st place, >> (but does not do update (=account delta to time stats)): >> 1st reader's query @ now = 60: >> idle=1000 >> iowait=110 (=100+(60-50)) >> >> then blocked tasks are migrated all: >> nr_iowait=0 >> >> and at last in 2nd turn observer assign delta as idle: >> 2nd reader's query @ now = 70: >> idle=1020 (=1000+(70-50)) >> iowait=100 >> >> You will see that iowait is decreased from 110 to 100. >> >> In summary iowait accounting has fundamental problem and needs >> to be precisely reworked. It implies that some user interfaces >> might be replaced completely. It will take long time to be >> solved, so workaround for compatibility will be appreciated. > > This isn't actually true. The way the current ->nr_iowait accounting > works is that we inc/dec against the cpu the task went to sleep on. So > task migration won't actually affect nr_iowait. The only way to > decrement nr_iowait is to wake a task. You are right. I used "migration" because still I had an old thought that "sleeping task is belonging to cpu where it went to sleep on." I'll update description here. > That's of course still complete crap, imagine a task going into iowait > sleep on CPU1 at t0. At t1 it wakes on CPU0, but CPU1 stays in nohz > sleep. Then at t2 CPU1 wakes and updates its sleep times. > > Between t0 and t1 a reader will observe iowait on CPU1 and report iowait > + x; x < t1 - t0. However a reader at >=t1 will observe no iowait and > report the old iowait time again but an increased idle time. Furthermore > when at t2 CPU1 wakes it will observe no iowait and account the entire > duration as idle, the iowait never happened. I don't doubt that "per-cpu iowait is complete crap." However I concern there is no solution yet even though we already have spent about a year after this "counter goes backward" have reported as regression. [continue to your next reply] Thanks, H.Seto -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html