(2014/04/15 19:19), Peter Zijlstra wrote: > On Thu, Apr 10, 2014 at 06:13:54PM +0900, Hidetoshi Seto wrote: >> [WHAT THIS PATCH PROPOSED]: >> >> To fix problem 1, this patch adds seqcount for NO_HZ idle >> accounting to avoid possible races between reader/writer. >> >> And to cope with problem 2, I introduced delayed iowait >> accounting to get approximate value without making observers >> to writers. Refer comment in patch for the detail. > >> --- a/kernel/time/tick-sched.c >> +++ b/kernel/time/tick-sched.c >> @@ -407,15 +407,42 @@ static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now) >> { >> ktime_t delta; >> >> + write_seqcount_begin(&ts->idle_sleeptime_seq); >> + >> /* Updates the per cpu time idle statistics counters */ >> delta = ktime_sub(now, ts->idle_entrytime); >> + >> + /* >> + * Perform delayed iowait accounting: >> + * >> + * We account sleep time as iowait when nr_iowait of cpu indicates >> + * there are taskes blocked by io, at the end of idle (=here). >> + * It means we can not determine whether the sleep time will be idle >> + * or iowait on the fly. >> + * Therefore introduce a new rule: >> + * - basically observers assign delta to idle >> + * - if cpu find nr_iowait>0 at idle exit, accumulate delta as missed >> + * iowait, and account it in next turn of sleep instead. >> + * - if observer find accumulated iowait while cpu is in sleep, it >> + * can calculate proper value to be accounted. >> + */ >> + if (ktime_compare(ts->iowait_pending, delta) > 0) { >> ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta); >> + ts->iowait_pending = ktime_sub(ts->iowait_pending, delta); >> + } else { >> + ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, >> + ktime_sub(delta, ts->iowait_pending)); >> + ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, >> + ts->iowait_pending); >> + ts->iowait_pending = ktime_set(0, 0); >> + } >> + if (nr_iowait_cpu(smp_processor_id()) > 0) >> + ts->iowait_pending = ktime_add(ts->iowait_pending, delta); >> + >> ts->idle_active = 0; >> >> + write_seqcount_end(&ts->idle_sleeptime_seq); >> + >> sched_clock_idle_wakeup_event(0); >> } > > Why!? Both changelog and comment are silent on this. This doesn't appear > to make any sense nor really solve anything. Sorry about my poor description. I hope I can clarify my idea and thoughts in the following sentence... [1] : should we make a change on a /proc/stat field semantic? As Frederic stated in previous mail: <quote> > So what we can do for example is to account it per task and update stats > on the CPU where the blocking task wakes up. This way we guarantee > that we only account locally, which is good for scalability. > > This is going to be an ABI change on a /proc/stat field semantic. > We usually can not do that as it can break userspace. But I think > we have a reasonable exception here: > > 1) On a performance POV we don't have the choice. > > 2) It has always been a buggy stat on SMP. Given the possible fast iowait update > rate, I doubt it has ever dumped correct stats. So I guess that few user apps > have ever correctly relied on it. </quote> Basically I agree with this idea if we maintain only latest upstream in development. But in case if target kernel is in family of stables or some kind of distributor's kernel, I suppose this idea is not acceptable because keeping semantics are very important for such environment. For example, you may propose that "hey, per-cpu iowait is completely crap! so how about making this field in /proc/stat to stick to 0?" It would be OK for latest upstream, as interim measure till new iowait accounting mechanism is invented. But for stable kernels, it will bring new regression report so it will not be any help. So we need 2 operations: a) remove regression b) implement new iowait accounting mechanism What Frederic mentioned is that we don't need a) once if we invent the solution for b). But I doubt it because a) is still required for stable environment including some distributor's kernel. It is clear that patches for b) will not be backportable. Still the b) is disease that has no known cure. There is no reason to wait works on b) before starting works for a). If we need a semantics change, we can do it for latest upstream. But OTOH regressions on stable kernels must be fixed. That's why I wrote these patch set. [2] : minimum locking for performance To keep semantics and emulates behavior of tick-based accounting, my v2 patch set uses seqlock to allow observer to update time stats of sleeping cpu. I believe v2 do the very minimum implement. However there were concerns about the performance affect by the new seqlock. So I started v3 as rework to use minimum locking. What we required here are: - Monotonicity: It should be, and it is what we were complained about. - Accuracy: No, it is really inaccurate from the first. However expected is "roughly same with traditional tick-based accounting." - Exclusivity: Expected. A sleep duration time must be accounted either idle or iowait, not both. - Performance: Even if per-cpu iowait have gone, idle time need to be calculated on observing cpu with delta and accumulated, and beside update performed locally on idle exit. Therefore minimum locking here should be seqcount (as Frederic already expected). Therefore v3 is designed to use seqcount instead of seqlock, by sacrificing accuracy (=or... reasonability?) on some level. [3] : new tricks To use seqcount, observers must be readers and never be writers. It means that: - Observed cpu's time stats are fixed at idle entry, and unchanged while sleeping (otherwise results of readers will not be coherent). - Readers must not refer nr_iowait of sleeping cpu because it can be changed by task woken up on other cpu. At this point: - As already pointed out, stating "I'll sleep as iowait" at idle entry will result in infinite iowait. => Then how about stating: "I'll sleep for <e.g. few nsec> as iowait and rest as idle"? => how to determine reasonable <few nsecs>? - Original code accounts iowait only when nr_iowait is >0 at idle exit. It means we can not determine whether the sleep time will be idle or iowait on the fly. => we cannot determine <few nsecs> at idle entry - No matter how long >0 continues (e.g. sleep 100ms while nr_iowait>0 until 99ms), we ignore iowait in sleep time if nr_iowait=0 at the end. => if we give temporal arbitrary <few nsec> for every idle entry, it will bump up iowait unreasonably. - It is completely crap but in fact no one complain about the inaccuracy. => No one notice it if accounted iowait value have <few nsec> gaps, even if it was <few msec>. So new strategy is: >> + * Therefore introduce a new rule: >> + * - basically observers assign delta to idle >> + * - if cpu find nr_iowait>0 at idle exit, accumulate delta as missed >> + * iowait, and account it in next turn of sleep instead. >> + * - if observer find accumulated iowait while cpu is in sleep, it >> + * can calculate proper value to be accounted. ... I think I wrote a lot but might missed some point to be clear. Questions? I'll write v4 while waiting new replies. Thanks, H.Seto -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html