On Wed, Feb 01, 2023 at 12:53:02PM +0800, Hillf Danton wrote: > On Tue, 31 Jan 2023 15:44:00 +0100 Thomas Gleixner <tglx@xxxxxxxxxxxxx> > > > > Seriously this procfs accuracy is the least of the problems and if this > > would be the only issue then we could trivially fix it by declaring that > > the procfs output might go backwards. It's an estimate after all. If > > there would be a real reason to ensure monotonicity there then we could > > easily do that in the readout code. > > > > But the real issue is that both get_cpu_idle_time_us() and > > get_cpu_iowait_time_us() can invoke update_ts_time_stats() which is way > > worse than the above procfs idle time going backwards. > > > > If update_ts_time_stats() is invoked concurrently for the same CPU then > > ts->idle_sleeptime and ts->iowait_sleeptime are turning into random > > numbers. > > > > This has been broken 12 years ago in commit 595aac488b54 ("sched: > > Introduce a function to update the idle statistics"). > > [...] > > > > > P.S.: I hate the spinlock in the idle code path, but I don't have a > > better idea. > > Provided the percpu rule is enforced, the random numbers mentioned above > could be erased without another spinlock added. > > Hillf > +++ b/kernel/time/tick-sched.c > @@ -640,13 +640,26 @@ static void tick_nohz_update_jiffies(kti > /* > * Updates the per-CPU time idle statistics counters > */ > -static void > -update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, u64 *last_update_time) > +static u64 update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, > + int io, u64 *last_update_time) > { > ktime_t delta; > > + if (last_update_time) > + *last_update_time = ktime_to_us(now); > + > if (ts->idle_active) { > delta = ktime_sub(now, ts->idle_entrytime); > + > + /* update is only expected on the local CPU */ > + if (cpu != smp_processor_id()) { Why not just updating it only on idle exit then? > + if (io) I fear it's not up to the caller to decides if the idle time is IO or not. > + delta = ktime_add(ts->iowait_sleeptime, delta); > + else > + delta = ktime_add(ts->idle_sleeptime, delta); > + return ktime_to_us(delta); > + } > + > if (nr_iowait_cpu(cpu) > 0) > ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta); > else But you kept the old update above. So if this is not the local CPU, what do you do? You'd need to return (without updating iowait_sleeptime): ts->idle_sleeptime + ktime_sub(ktime_get(), ts->idle_entrytime) Right? But then you may race with the local updater, risking to return the delta added twice. So you need at least a seqcount. But in the end, nr_iowait_cpu() is broken because that counter can be decremented remotely and so the whole thing is beyond repair: CPU 0 CPU 1 CPU 2 ----- ----- ------ //io_schedule() TASK A current->in_iowait = 1 rq(0)->nr_iowait++ //switch to idle // READ /proc/stat // See nr_iowait_cpu(0) == 1 return ts->iowait_sleeptime + ktime_sub(ktime_get(), ts->idle_entrytime) //try_to_wake_up(TASK A) rq(0)->nr_iowait-- //idle exit // See nr_iowait_cpu(0) == 0 ts->idle_sleeptime += ktime_sub(ktime_get(), ts->idle_entrytime) Thanks.