On Wed, Feb 01, 2023 at 10:01:17PM +0800, Hillf Danton wrote: > > > +++ b/kernel/time/tick-sched.c > > > @@ -640,13 +640,26 @@ static void tick_nohz_update_jiffies(kti > > > /* > > > * Updates the per-CPU time idle statistics counters > > > */ > > > -static void > > > -update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, u64 *last_update_time) > > > +static u64 update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, > > > + int io, u64 *last_update_time) > > > { > > > ktime_t delta; > > > > > > + if (last_update_time) > > > + *last_update_time = ktime_to_us(now); > > > + > > > if (ts->idle_active) { > > > delta = ktime_sub(now, ts->idle_entrytime); > > > + > > > + /* update is only expected on the local CPU */ > > > + if (cpu != smp_processor_id()) { > > > > Why not just updating it only on idle exit then? > > This aligns to idle exit as much as it can by disallowing remote update. I mean why bother updating if idle does it for us already? One possibility is that we get some more precise values if we read during long idle periods with nr_iowait_cpu() changes in the middle. > > > > > + if (io) > > > > I fear it's not up to the caller to decides if the idle time is IO or not. > > Could you specify a bit on your concern, given the callers of this function? You are randomly stating if the elapsing idle time is IO or not depending on the caller, without verifying nr_iowait_cpu(). Or am I missing something? > > > > > + delta = ktime_add(ts->iowait_sleeptime, delta); > > > + else > > > + delta = ktime_add(ts->idle_sleeptime, delta); > > > + return ktime_to_us(delta); > > Based on the above comments, I guest you missed this line which prevents > get_cpu_idle_time_us() and get_cpu_iowait_time_us() from updating ts. Right... > > But then you may race with the local updater, risking to return > > the delta added twice. So you need at least a seqcount. > > Add seqcount if needed. No problem. > > > > But in the end, nr_iowait_cpu() is broken because that counter can be > > decremented remotely and so the whole thing is beyond repair: > > > > CPU 0 CPU 1 CPU 2 > > ----- ----- ------ > > //io_schedule() TASK A > > current->in_iowait = 1 > > rq(0)->nr_iowait++ > > //switch to idle > > // READ /proc/stat > > // See nr_iowait_cpu(0) == 1 > > return ts->iowait_sleeptime + ktime_sub(ktime_get(), ts->idle_entrytime) > > > > //try_to_wake_up(TASK A) > > rq(0)->nr_iowait-- > > //idle exit > > // See nr_iowait_cpu(0) == 0 > > ts->idle_sleeptime += ktime_sub(ktime_get(), ts->idle_entrytime) > > Ah see your point. > > The diff disallows remotely updating ts, and it is updated in idle exit > after my proposal, so what nr_iowait_cpu() breaks is mitigated. Only halfway mitigated. This doesn't prevent from backward or forward jumps when non-updating readers are involved at all. Thanks. > > Thanks for taking a look, particularly the race linked to nr_iowait_cpu(). > > Hillf