Re: vread in kvm_clock

Zachary Amsden <zamsden@xxxxxxxxxx> · Mon, 20 Dec 2010 10:48:28 -1000

On 12/19/2010 08:16 PM, Avi Kivity wrote:
On 12/20/2010 03:15 AM, Zachary Amsden wrote:
On 12/19/2010 05:27 AM, Avi Kivity wrote:
On 12/17/2010 07:43 PM, Zachary Amsden wrote:
On 12/15/2010 10:16 AM, Julien Desfossez wrote:
Hi,

I'm currently working with the kvm clocksource and I'm wondering 
if we could implement the vread function for this clock source 
when we are running on a host with constant_tsc.
If I understand correctly the hv_clock structure is per_cpu 
because of the eventual frequency changes, but in the case of 
constant_tsc (and after validation that the TSC is synchronized 
across all the cores) I think we could have a working vread function.

In case of migration, could we have a fallback in case we detect 
we end up on a CPU without constant_tsc ?

Any advice/explanation would be greatly appreciated !

It's a bit more complex than that.  In addition to the problem you 
mention with migration, even if the TSC is synchronized, the 
kvmclock still is not, even with constant_tsc.  There is 
measurement error in between reading the TSC and computing the 
per_cpu hv_clock offset which varies between CPUs.

What about using rdtscp?

We could also disable kvmclock if constant_tsc and migration is not 
desired, or if constant_tsc and the new tsc multiplier on bulldozers 
is available on all machines in the migration cluster.

Even then, we need an atomic in the vread path.

Why?

RDTSCP will get you a per-CPU measured TSC+offset.  But even if there is 
global agreement on the TSC, there is still variation on the offset 
because kvmclock offsets are computed at different times in a per-cpu 
fashion.

So you can still go backwards in a global measurement with kvmclock, 
even with a synchronized TSC.

The tsc multiplier does not look usable for virtualization, btw.

Why?

The documentation I've seen implies it is a single MSR that controls the 
TSC multiplier.  This implies two massively negative potential ways it 
works:

1) It can be set during the switching of MSRs when transitioning to the 
VMCB.  Then, it accelerates the TSC while running in SVM.  Observe that 
the time spent not running in SVM for a particular VMCB, regular TSC 
time will pass.  This time must be accounted by counting real time and 
adjusting the TSC offset each and every time the VMCB is used.

First, note that counting real time with the TSC is impossible if you 
switch CPUs and the hardware TSCs are not synchronized.  You must fall 
back to a secondary clock source, which greatly reduces precision of the 
computation.

Second, note that even if the hardware TSCs are synchronized, you are 
now re-computing the offset, in fact, MUST re-compute the TSC offset at 
each and every entry to the VMCB.  This means SMP VMs will have 
desynchronized TSCs, because every computation of the offset introduces 
a variable amount of error and all VCPUs will perform this computation 
at different times.

You end up back with a continuous random error applied to kvmclock / TSC 
for all CPUs and the required atomic check for TSC going backwards.

So it saves the exit cost, but at the cost of even greater complexity, 
which we don't need in the TSC code.  The performance issue in the 
migration scenario is not really that big of a deal.  Migrate to a 
slower machine; we catch up the TSC every chance we get to bring it back 
up to speed, and indicate that the atomic backwards protection is 
needed.  Migrate to a faster machine; we trap the TSC.  Yes, it's 
expensive.  It also is running on a faster machine, probably greater 
than 10% faster, which is about what we'll see for overhead.  So there 
is no net performance degradation.  Reboot the VM, maximum potential 
performance is restored.

2) It isn't clear that setting the MSR affects only the TSC in SVM 
mode.  If that were the case, why did they not add a new VMCB field 
extension, why is it a hardware MSR, which can be set outside of SVM?

The scary implication here is that the MSR may actually affect the 
hardware TSC rate, in which case, using it for SVM scaling destabilizes 
the host TSC.

That's completely non-usable.  I hope it doesn't work that way, I hope 
AMD was clever enough to use the auxiliary TSC for this, but that means 
there are still complex issues when you have multiple VMs running at 
different scaled TSC rates.

My take on in, either way, regardless of how it is implemented is that 
basically, it adds a lot more complexity to the equation without solving 
the fundamental problem.  There's already a good enough software 
solution, we should use that.

Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html