Re: Nested VMX - L1 hangs on running L2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



2011/7/27 Nadav Har'El <nyh@xxxxxxxxxxxxxxxxxxx>:
> On Wed, Jul 20, 2011, Zachary Amsden wrote about "Re: Nested VMX - L1 hangs on running L2":
>> > > No, both patches are wrong.
>> >
>>
>> kvm_get_msr(vcpu, MSR_IA32_TSC, &tsc) should always return the L1 TSC,
>> regardless of the setting of any MSR bitmap. The reason why is that it
>> is being called by the L0 hypervisor kernel, which handles only
>> interactions with the L1 MSRs.
>
> guest_read_tsc() (called by the above get_msr) currently does this:
>
>        static u64 guest_read_tsc(void)
>        {
>                u64 host_tsc, tsc_offset;
>
>                rdtscll(host_tsc);
>                tsc_offset = vmcs_read64(TSC_OFFSET);
>                return host_tsc + tsc_offset;
>        }

That's wrong.  You should NEVER believe the offset written into the
hardware VMCS to be the current, real L1 TSC offset, as that is not an
invariant.

Instead, you should ALWAYS return the host TSC + the L1 TSC offset.
Sometimes, this may be the hardware value.

> I guess you'd want this to change to something like:
>
>                tsc_offset = is_guest_mode(vcpu) ?
>                        vmx->nested.vmcs01_tsc_offset :
>                        vmcs_read64(TSC_OFFSET);
>
> But I still am not convinced that that would be right....

I believe this is correct.  But may it be cheaper to read from the
in-memory structure than the actual hardware VMCS?

> E.g., imagine the case where L1 uses TSC_OFFSETING and but doesn't
> trap TSC MSR read. The SDM says (if I understand it correctly) that this TSC
> MSR read will not exit (because L1 doesn't trap it) but *will* do the extra
> offsetting. In this case, the original code (using vmcs02's TSC_OFFSET which
> is the sum of that of vmcs01 and vmcs12), is correct, and the new code will
> be incorrect. Or am I misunderstanding the SDM?

In that case, you need to distinguish between reads of the TSC MSR by
the guest and reads by the host (as done internally to track drift and
compensation).  The code that needs to change isn't guest_read_tsc(),
that code must respect the invariant of only returning the L1 guest
TSC (in fact, that may be a good name change for the function).  What
needs to change is the actual code involved in the MSR read.  If it
determines that something other than the L1 guest is running, it needs
to ignore the hardware TSC offset and return the TSC as if read by the
L1 guest.

Unfortunately, the layering currently doesn't seem to allow for this,
and it looks like both vendor specific variants currently get this
wrong.

The call stack:

kvm_get_msr()
kvm_x86_ops->get_msr()
vendor_get_msr()
vendor_guest_read_tsc()

offers no insight as to the intention of the caller.  Is it trying to
get the guest TSC to return to the guest, or is it trying to get the
guest TSC to calibrate / measure and compensate for TSC effects?

So you are right, this is still wrong for the case in which L1 does
not trap TSC MSR reads.  Note however, the RDTSC instruction is still
virtualized properly, it is only the relatively rare actual TSC MSR
read via RDMSR which is mis-virtualized (this bug exists today in the
SVM implementation if I am reading it correctly - cc'd Joerg to notify
him of that).  That, combined with the relative importance of
supporting a guest which does not trap on these MSR reads suggest this
is a low priority design issue however (RDTSC still works even if the
MSR is trapped, correct?)

If you want to go the extra mile to support such guests, the only
fully correct approach then is to do one of the following:

1) Add a new vendor-specific API call, vmx_x86_ops->get_L1_TSC(), and
transform current uses of the code which does TSC compensation to use
this new API.  *Bonus* - no need to do double indirection through the
generic MSR code.

or, alternatively:

2) Do not trap MSR reads of the TSC if the current L1 guest is not
trapping MSR reads of the TSC.  This is not possible if you cannot
enforce specific read vs. write permission in hardware - it may be
possible however, if you can trap all MSR writes regardless of the
permission bitmap.

> Can you tell me in which case the original code would returen incorrect
> results to a guest (L1 or L2) doing anything MSR-related?

It never returns incorrect values to a guest.  It does however return
incorrect values to the L0 hypervisor, which is expecting to do
arithmetic based on the L1 TSC value, and this fails catastrophically
when it receives values for other nested guests.

> I'm assuming that some code in KVM also uses kvm_read_msr and assumes it
> gets the TSC value for L1, not for the guest currently running (L2 or L1).

Exactly.

> I don't understand why it needs to assume that... Why would it be wrong to
> return L2's TSC, and just remember that *changing* the L2 TSC really means
> changing the L1 TSC offset (vmcs01_tsc_offset), not vmcs12.tsc_offset which
> we can't touch?

L0 measures the L1 TSC at various points to be sure the L1 TSC never
regresses by going backwards, and also to restore the value if there
is a non-terminal hardware power interruption (hardware sleep which
resets hardware TSC to zero, but leaves the system image intact - i.e
S1 sleep).  If the measurement returns something other than the L1
TSC, this fails.

So the API there is today is broken, the way the code is today, it
needs to return the L1 TSC only - but the incorrect virtualization is
limited to quite an obscure case which doesn't appear to be a problem
in practice.  And as I pointed out, you can either fix the API, or if
you have cooperative hardware, ignore the issue and simply do not trap
on TSC MSR reads from L2 guests.

Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux