Re: [PATCH 0/4] Alter steal-time reporting in the guest

Marcelo Tosatti <mtosatti@xxxxxxxxxx> · Thu, 7 Mar 2013 23:21:14 -0300

On Thu, Mar 07, 2013 at 10:54:37PM -0300, Marcelo Tosatti wrote:
> On Thu, Mar 07, 2013 at 04:34:16PM -0600, Michael Wolf wrote:
> > On Thu, 2013-03-07 at 18:25 -0300, Marcelo Tosatti wrote:
> > > On Thu, Mar 07, 2013 at 03:15:09PM -0600, Michael Wolf wrote:
> > > > > 
> > > > > Makes sense?
> > > > > 
> > > > > Not sure what the concrete way to report stolen time relative to hard
> > > > > capping is (probably easier inside the scheduler, where run_delay is
> > > > > calculated).
> > > > > 
> > > > > Reporting the hard capping to the guest is a good idea (which saves the
> > > > > user from having to measure it themselves), but better done separately
> > > > > via new field.
> > > > 
> > > > didnt respond to this in the previous response.  I'm not sure I'm
> > > > following you here.  I thought this is what I was doing by having a
> > > > consigned (expected steal) field add to the /proc/stat output.  Are you
> > > > looking for something else or a better naming convention?
> > > 
> > > Expected steal is not a good measure to use (because as mentioned in the
> > > previous email there is no expected steal over a fixed period of time).
> > > 
> > > It is fine to report 'maximum percentage of underlying physical CPU'
> > > (what percentage of the physical CPU time guest VM is allowed to make
> > > use of).
> > > 
> > > And then steal time is relative to maximum percentage of underlying
> > > physical CPU time allowed.
> > > 
> > 
> > So last August I had sent out an RFC set of patches to do this.  That
> > patchset was meant to handle the general overcommit case as well as the
> > capping case by having qemu pass a percentage to the host that would
> > then be passed onto the guest and used to adjust the steal time.
> > Here is the link to the discussion
> > http://lkml.indiana.edu/hypermail/linux/kernel/1208.3/01458.html
> > 
> > As you will see there Avi didn't like the idea of a percentage down in
> > the guest, among other reasons he was concerned about migration.  

OK.

> > Also in the email thread you will see that Anthony Liguori was
> > opposed to the idea of just changing the steal time, he wanted it
> > split out.

"What I had previously suggested what splitting entitlement loss out of
steal time and reporting it as a separate metric (but not reporting a
fixed notion of entitlement).

You're missing the entitlement loss bit above. But you need to call
out entitlement loss in order to report idle time correctly.

I think changing steal time (as this patch does) is wrong.

Regards,

Anthony Liguori"

This is what is suggested below. What you mentioned earlier

"So in this case each guest will have time on the runqueue but neither
will ever be throttled since they will not exceed their quota in the
defined period.  So now just trying to do this in the scheduler doesn't
work because you cannot rely on the throttled flag.  In either case the
time is accumulated as time on the runqueue.

This is why my patchset had included a timer.  It was basically
mimicking the bandwidth controller by using a timer set to the same
period.  So in a given period of time a fixed quota of time on the
runqueue can be expected.  If the amount of time on the runqueue exceeds
the expected, then report it."

Understood, but its problematic: it is possible for a vcpu to be
deprived of cycles even if it did not exceed its quota. Did you
investigate whether its possible to split run_delay?

> > What Glauber has suggested and I am working on implementing is taking
> > out the timer and adding a last read field in the host.  So in the host
> > I can determine the total time that has passed and compute a percentage
> > and apply that percentage to the steal time while the info is still on
> > the host.  Then pass the steal and consigned time to the guest.

Or maybe i missed why the suggestion above is immune to this problem?

> > 
> > Does that address your concerns?
> 
> I am not asking about passing percentage down the host - just pointing
> out a counter example to the correctness of the current algorithm.
> 
> I cannot see how you can report proper steal time value relative to
> hard cap without having that number calculated in the scheduler. IOW,
> "run_delay" must be split in two: you want to differentiate whether run
> delay was due to hard cap exhaustion or due to other reasons. Without
> that, steal time reporting is incorrect (as the example details). Now
> the question is, how to do that separation.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html