Caution: this requires more care. Pretty sure this breaks userspace suspend at the cost of supporting a not-so-reasonable hardware feature. On Tue, Aug 2, 2011 at 5:55 AM, Nadav Har'El <nyh@xxxxxxxxxx> wrote: > When the TSC MSR is read by an L2 guest (when L1 allowed this MSR to be > read without exit), we need to return L2's notion of the TSC, not L1's. > > The current code incorrectly returned L1 TSC, because svm_get_msr() was also > used in x86.c where this was assumed, but now that these places call the new > svm_read_l1_tsc(), the MSR read can be fixed. > > Signed-off-by: Nadav Har'El <nyh@xxxxxxxxxx> > --- > arch/x86/kvm/svm.c | 4 +--- > 1 file changed, 1 insertion(+), 3 deletions(-) > > --- .before/arch/x86/kvm/svm.c 2011-08-02 15:51:02.000000000 +0300 > +++ .after/arch/x86/kvm/svm.c 2011-08-02 15:51:02.000000000 +0300 > @@ -2907,9 +2907,7 @@ static int svm_get_msr(struct kvm_vcpu * > > switch (ecx) { > case MSR_IA32_TSC: { > - struct vmcb *vmcb = get_host_vmcb(svm); > - > - *data = vmcb->control.tsc_offset + > + *data = svm->vmcb->control.tsc_offset + > svm_scale_tsc(vcpu, native_read_tsc()); > > break; > This is correct. Now you properly return the correct MSR value for the TSC to the guest in all cases. However, there is a BIG PROBLEM (yes, it is that bad). Sorry I did not think of it before. When the guest gets suspended and frozen by userspace, to be restarted later, what qemu is going to do is come along and read all of the MSRs as part of the saved state. One of those happens to be the TSC MSR. Since you can't guarantee suspend will take place when only L1 is running, you may have mixed L1/L2 TSC MSRs now being returned to userspace. Now, when you resume this guest, you will have mixed L1/L2 TSC values written into the MSRs. Those will almost certainly fail to be matched by the TSC offset matching code, and thus, with multiple VCPUs, you will end up with slightly unsynchronized TSCs, and with that, all the problems associated with unstable TSC come back to haunt you again. Basically, all bets for stability are off. Apart from recording the L1 and L2 TSC through separate MSRs, there isn't a good way to solve this. Walking through all the steps again, I'm pretty sure this is why I thought it would be better for L0 kvm_get_msr() to return the L1 values even if L2 was running. In the end, it may not be worth the hassle to support this mode of operation that to our current knowledge, is in fact, unused. I would much prefer to see a bad virtualization of a known insecure and unused guest mode than to have a well used feature such as L1 guest suspend / resume continue to cause clock de-synchronization. Zach -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html