On Tue, 31 Jul 2018 at 00:43, Vincent Guittot <vincent.guittot@xxxxxxxxxx> wrote: > > Hi Wanpeng, > > On Thu, 26 Jul 2018 at 05:09, Wanpeng Li <kernellwp@xxxxxxxxx> wrote: > > > > Hi Vincent, > > On Fri, 29 Jun 2018 at 03:07, Vincent Guittot > > <vincent.guittot@xxxxxxxxxx> wrote: > > > > > > interrupt and steal time are the only remaining activities tracked by > > > rt_avg. Like for sched classes, we can use PELT to track their average > > > utilization of the CPU. But unlike sched class, we don't track when > > > entering/leaving interrupt; Instead, we take into account the time spent > > > under interrupt context when we update rqs' clock (rq_clock_task). > > > This also means that we have to decay the normal context time and account > > > for interrupt time during the update. > > > > > > That's also important to note that because > > > rq_clock == rq_clock_task + interrupt time > > > and rq_clock_task is used by a sched class to compute its utilization, the > > > util_avg of a sched class only reflects the utilization of the time spent > > > in normal context and not of the whole time of the CPU. The utilization of > > > interrupt gives an more accurate level of utilization of CPU. > > > The CPU utilization is : > > > avg_irq + (1 - avg_irq / max capacity) * /Sum avg_rq > > > > > > Most of the time, avg_irq is small and neglictible so the use of the > > > approximation CPU utilization = /Sum avg_rq was enough > > > > > > Cc: Ingo Molnar <mingo@xxxxxxxxxx> > > > Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > > > Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx> > > > --- > > > kernel/sched/core.c | 4 +++- > > > kernel/sched/fair.c | 13 ++++++++++--- > > > kernel/sched/pelt.c | 40 ++++++++++++++++++++++++++++++++++++++++ > > > kernel/sched/pelt.h | 16 ++++++++++++++++ > > > kernel/sched/sched.h | 3 +++ > > > 5 files changed, 72 insertions(+), 4 deletions(-) > > > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > > index 78d8fac..e5263a4 100644 > > > --- a/kernel/sched/core.c > > > +++ b/kernel/sched/core.c > > > @@ -18,6 +18,8 @@ > > > #include "../workqueue_internal.h" > > > #include "../smpboot.h" > > > > > > +#include "pelt.h" > > > + > > > #define CREATE_TRACE_POINTS > > > #include <trace/events/sched.h> > > > > > > @@ -186,7 +188,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) > > > > > > #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING) > > > if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY)) > > > - sched_rt_avg_update(rq, irq_delta + steal); > > > + update_irq_load_avg(rq, irq_delta + steal); > > > > I think we should not add steal time into irq load tracking, steal > > time is always 0 on native kernel which doesn't matter, what will > > happen when guest disables IRQ_TIME_ACCOUNTING and enables > > PARAVIRT_TIME_ACCOUNTING? Steal time is not the real irq util_avg. In > > addition, we haven't exposed power management for performance which > > means that e.g. schedutil governor can not cooperate with passive mode > > intel_pstate driver to tune the OPP. To decay the old steal time avg > > and add the new one just wastes cpu cycles. > > In fact, I have kept the same behavior as with rt_avg, which was > already adding steal time when computing scale_rt_capacity, which is > used to reflect the remaining capacity for FAIR tasks and is used in > load balance. I'm not sure that it's worth using different variables > for irq and steal. > That being said, I see a possible optimization in schedutil when > PARAVIRT_TIME_ACCOUNTING is enable and IRQ_TIME_ACCOUNTING is disable. > With this kind of config, scale_irq_capacity can be a nop for > schedutil but scales the utilization for scale_rt_capacity Yeah, this is what in my mind before, you can make a patch for that. :) Regards, Wanpeng Li