在 2022/1/5 上午9:48, Josh Don 写道: > Hi Cruz, > Firstly, attributing forced idle time to the specific cpus it happens on can help us measure the effect of steal_cookie_task(). We found out that steal_cookie_task() conflicts with load balance sometimes, for example, a cookie'd task is stolen by steal_cookie_task(), but it'll be migrated to another core by load balance soon. Secondly, a more convenient way of summing forced idle instead of iterating cookie'd task is indeed what we need. In the multi-rent scenario, it'll be complex to maintain the list of cookie'd task and it'll cost a lot to iterate it. > Could you add a bit more background to help me understand what case > this patch solves? Is your issue that you want to be able to > attribute forced idle time to the specific cpus it happens on, or do > you simply want a more convenient way of summing forced idle without > iterating your cookie'd tasks and summing the schedstat manually? > >> @@ -190,6 +202,9 @@ static int show_stat(struct seq_file *p, void *v) >> seq_put_decimal_ull(p, " ", nsec_to_clock_t(steal)); >> seq_put_decimal_ull(p, " ", nsec_to_clock_t(guest)); >> seq_put_decimal_ull(p, " ", nsec_to_clock_t(guest_nice)); >> +#ifdef CONFIG_SCHED_CORE >> + seq_put_decimal_ull(p, " ", nsec_to_clock_t(cookied_forceidle)); >> +#endif > I'll put this in /proc/schedstat and fix the problem that accounting simply idle as force idle in the next version. Many thanks for suggestions. Best, Cruz Zhao > IMO it would be better to always print this stat, otherwise it sets a > weird precedent for new stats added in the future (much more difficult > for userspace to reason about which column corresponds with which > field, since it would depend on kernel config). > > Also, did you intend to put this in /proc/stat instead of > /proc/schedstat (the latter of which would be more attractive to > prevent calculation of these stats unless schestat was enabled)? > >> diff --git a/kernel/sched/core_sched.c b/kernel/sched/core_sched.c >> @@ -260,6 +261,21 @@ void __sched_core_account_forceidle(struct rq *rq) >> >> rq->core->core_forceidle_start = now; >> >> + for_each_cpu(i, smt_mask) { >> + rq_i = cpu_rq(i); >> + p = rq_i->core_pick ?: rq_i->curr; >> + >> + if (!rq->core->core_cookie) >> + continue; > > I see this is temporary given your other patch, but just a note that > if your other patch is dropped, this check can be pulled outside the > loop. > >> + if (p == rq_i->idle && rq_i->nr_running) { >> + cpustat = kcpustat_cpu(i).cpustat; >> + cpustat[CPUTIME_COOKIED_FORCEIDLE] += delta; >> + } >> + } > > I don't think this is right. If a cpu was simply idle while some other > SMT sibling on its core was forced idle, and then a task happens to > wake on the idle cpu, that cpu will now be charged the full delta here > as forced idle (when actually it was never forced idle, we just > haven't been through pick_next_task yet). One workaround would be to > add a boolean to struct rq to cache whether the rq was in forced idle > state. > > Best, > Josh