Re: [RFC 4/7] change kernel accounting to include steal time

Glauber Costa <glommer@xxxxxxxxxx> · Thu, 26 Aug 2010 17:28:56 -0300

On Thu, Aug 26, 2010 at 02:23:03PM -0300, Marcelo Tosatti wrote:
> On Wed, Aug 25, 2010 at 05:43:14PM -0400, Glauber Costa wrote:
> > This patch proposes a common steal time implementation. When no
> > steal time is accounted, we just add a branch to the current
> > accounting code, that shouldn't add much overhead.
> > 
> > When we do want to register steal time, we proceed as following:
> > - if we would account user or system time in this tick, and there is
> >   out-of-cpu time registered, we skip it altogether, and account steal
> >   time only.
> > - if we would account user or system time in this tick, and we got the
> >   cpu for the whole slice, we proceed normaly.
> > - if we are idle in this tick, we flush out-of-cpu time to give it the
> >   chance to update whatever last-measure internal variable it may have.
> 
> Problem of using sched notifiers is that you don't differentiate whether
> the vcpu scheduled out by its own (via hlt emulation) or not.
And we don't need to. If we're out because we want to, we're idle.
And so, we don't account steal time.

> Skipping accounting of user/system time whenever there's any stolen
> time detected probably breaks u/s accounting on non-cpu-hog loads.
I am willing to test some workloads you can suggest, but right now,
(yeah, I mostly used cpu-hogs), this scheme worked better.

Linux does statistical sampling for accounting anyway, so I don't see
it getting much worse.

> 
> I suppose steal time should be accounted separately from u/s ticks, as
> Xen does.
It requires us to hook somewhere else, which I deem as overcomplicated.
Do you have any suggestion on how to make it simple?

Furthermore, "doing separate", is equivalent of not skipping user/system,
if we really prefer to.

> +   if (delta > 1000UL)
> +               touch_softlockup_watchdog();
> +
> 
> This will break authentic soft lockup detection whenever qemu processing
> takes more than 1s.

This should be 10s. 1000UL is a typo.

> 
> > 
> > This approach is simple, but proved to work well for my test scenarios.
> > in a UP guest on UP host, with a cpu-hog in both guest and host shows
> > ~ 50 % steal time. steal time is also accounted proportionally, if
> > nice values are given to the host cpu-hog.
> > 
> > A cpu-hog in the host with no load in the guest, produces 0 % steal time,
> > with 100 % idle, as one would expect.
> > 
> > Signed-off-by: Glauber Costa <glommer@xxxxxxxxxx>
> > ---
> >  include/linux/sched.h |    1 +
> >  kernel/sched.c        |   29 +++++++++++++++++++++++++++++
> >  2 files changed, 30 insertions(+), 0 deletions(-)
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 0478888..e571ddd 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -312,6 +312,7 @@ long io_schedule_timeout(long timeout);
> >  extern void cpu_init (void);
> >  extern void trap_init(void);
> >  extern void update_process_times(int user);
> > +extern cputime_t (*hypervisor_steal_time)(void);
> >  extern void scheduler_tick(void);
> >  
> >  extern void sched_show_task(struct task_struct *p);
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index f52a880..9695c92 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -3157,6 +3157,16 @@ unsigned long long thread_group_sched_runtime(struct task_struct *p)
> >  	return ns;
> >  }
> >  
> > +cputime_t (*hypervisor_steal_time)(void) = NULL;
> > +
> > +static inline cputime_t get_steal_time_from_hypervisor(void)
> > +{
> > +	if (!hypervisor_steal_time)
> > +		return 0;
> > +	return hypervisor_steal_time();
> > +}
> > +
> > +
> >  /*
> >   * Account user cpu time to a process.
> >   * @p: the process that the cpu time gets accounted to
> > @@ -3169,6 +3179,12 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
> >  	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
> >  	cputime64_t tmp;
> >  
> > +	tmp = get_steal_time_from_hypervisor();
> > +	if (tmp) {
> > +		account_steal_time(tmp);
> > +		return;
> > +	}
> > +
> >  	/* Add user time to process. */
> >  	p->utime = cputime_add(p->utime, cputime);
> >  	p->utimescaled = cputime_add(p->utimescaled, cputime_scaled);
> > @@ -3234,6 +3250,12 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
> >  		return;
> >  	}
> >  
> > +	tmp = get_steal_time_from_hypervisor();
> > +	if (tmp) {
> > +		account_steal_time(tmp);
> > +		return;
> > +	}
> > +
> >  	/* Add system time to process. */
> >  	p->stime = cputime_add(p->stime, cputime);
> >  	p->stimescaled = cputime_add(p->stimescaled, cputime_scaled);
> > @@ -3276,6 +3298,13 @@ void account_idle_time(cputime_t cputime)
> >  	cputime64_t cputime64 = cputime_to_cputime64(cputime);
> >  	struct rq *rq = this_rq();
> >  
> > +	/*
> > +	 * if we're idle, we don't account it as steal time, since we did
> > +	 * not want to run anyway. We do call the steal function, however, to
> > +	 * give the guest the chance to flush its internal buffers
> > +	 */
> > +	get_steal_time_from_hypervisor();
> > +
> >  	if (atomic_read(&rq->nr_iowait) > 0)
> >  		cpustat->iowait = cputime64_add(cpustat->iowait, cputime64);
> >  	else
> > -- 
> > 1.6.2.2
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html