Re: [PATCH 1/2] sched/core: Cookied forceidle accounting per cpu

cruzzhao <cruzzhao@xxxxxxxxxxxxxxxxx> · Wed, 5 Jan 2022 19:33:18 +0800

在 2022/1/5 上午9:48, Josh Don 写道:
> Hi Cruz,
> 
Firstly, attributing forced idle time to the specific cpus it happens on
can help us measure the effect of steal_cookie_task(). We found out that
steal_cookie_task() conflicts with load balance sometimes, for example,
a cookie'd task is stolen by steal_cookie_task(), but it'll be migrated
to another core by load balance soon. Secondly, a more convenient way of
summing forced idle instead of iterating cookie'd task is indeed what we
need. In the multi-rent scenario, it'll be complex to maintain the list
of cookie'd task and it'll cost a lot to iterate it.

> Could you add a bit more background to help me understand what case
> this patch solves? Is your issue that you want to be able to
> attribute forced idle time to the specific cpus it happens on, or do
> you simply want a more convenient way of summing forced idle without
> iterating your cookie'd tasks and summing the schedstat manually?
> 
>> @@ -190,6 +202,9 @@ static int show_stat(struct seq_file *p, void *v)
>>                 seq_put_decimal_ull(p, " ", nsec_to_clock_t(steal));
>>                 seq_put_decimal_ull(p, " ", nsec_to_clock_t(guest));
>>                 seq_put_decimal_ull(p, " ", nsec_to_clock_t(guest_nice));
>> +#ifdef CONFIG_SCHED_CORE
>> +               seq_put_decimal_ull(p, " ", nsec_to_clock_t(cookied_forceidle));
>> +#endif
> 
I'll put this in /proc/schedstat and fix the problem that accounting
simply idle as force idle in the next version.

Many thanks for suggestions.

Best,
Cruz Zhao
> IMO it would be better to always print this stat, otherwise it sets a
> weird precedent for new stats added in the future (much more difficult
> for userspace to reason about which column corresponds with which
> field, since it would depend on kernel config).
> 
> Also, did you intend to put this in /proc/stat instead of
> /proc/schedstat (the latter of which would be more attractive to
> prevent calculation of these stats unless schestat was enabled)?
> 
>> diff --git a/kernel/sched/core_sched.c b/kernel/sched/core_sched.c
>> @@ -260,6 +261,21 @@ void __sched_core_account_forceidle(struct rq *rq)
>>
>>         rq->core->core_forceidle_start = now;
>>
>> +       for_each_cpu(i, smt_mask) {
>> +               rq_i = cpu_rq(i);
>> +               p = rq_i->core_pick ?: rq_i->curr;
>> +
>> +               if (!rq->core->core_cookie)
>> +                       continue;
> 
> I see this is temporary given your other patch, but just a note that
> if your other patch is dropped, this check can be pulled outside the
> loop.
> 
>> +               if (p == rq_i->idle && rq_i->nr_running) {
>> +                       cpustat = kcpustat_cpu(i).cpustat;
>> +                       cpustat[CPUTIME_COOKIED_FORCEIDLE] += delta;
>> +               }
>> +       }
> 
> I don't think this is right. If a cpu was simply idle while some other
> SMT sibling on its core was forced idle, and then a task happens to
> wake on the idle cpu, that cpu will now be charged the full delta here
> as forced idle (when actually it was never forced idle, we just
> haven't been through pick_next_task yet). One workaround would be to
> add a boolean to struct rq to cache whether the rq was in forced idle
> state.
> 
> Best,
> Josh