Re: Nested VMX - L1 hangs on running L2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jul 20, 2011 at 12:52 PM, Nadav Har'El <NYH@xxxxxxxxxx> wrote:
>
> > No, both patches are wrong.
>
> Guys, thanks for looking into this bug. I'm afraid I'm still at a loss at
> why a TSC bug would even cause a guest lockup :(
>
> When Avi Kivity saw my nested TSC handling code he remarked "this is
> probably wrong". When I asked him where it was wrong, he basically said
> that he didn't know where, but TSC handling code is always wrong ;-)
> And it turns out he was right.
>
> > The correct fix is to make kvm_get_msr() return the L1 guest TSC at all
> times.
> >  We are serving the L1 guest in this hypervisor, not the L2 guest, and so
>
> > should never read its TSC for any reason.
> ...
> > allows the L2 guest to overwrite the L1 guest TSC, which at first seems
> wrong,
> > but is in fact the correct virtualization of a security hole in the L1
> guest.
>
> I think I'm beginning to see the error in my ways...
>
> When L1 lets L2 (using the MSR bitmap) direct read/write access to the TSC,
> it doesn't want L0 to be "clever" and give L2 its own separate TSC (like
> I do now), but rather gives it full control over L1's TSC - so reading or
> writing it should actually return L1's TSC, and the TSC_OFFSET in vmcs12
> is to be ignored.


>
> So basically, if I understand correctly, what I need to change is
> in prepare_vmcs02(), if the MSR_IA32_TSC is on the MSR bitmap (read?
> write?), instead of doing
>        vmcs_write64(TSC_OFFSET,
>                vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
> I just need to do
>        vmcs_write64(TSC_OFFSET,
>                vmx->nested.vmcs01_tsc_offset);
> thereby giving L2 exactly the same TSC that L1 had.
> Brandan, if I remember correctly you once tried this sort of fix and
> it actually worked?
>
> Then, guest_read_tsc() will return (without need to change this code)
> the correct L1 TSC.
>

Note quite.

kvm_get_msr(vcpu, MSR_IA32_TSC, &tsc) should always return the L1 TSC,
regardless of the setting of any MSR bitmap. The reason why is that it
is being called by the L0 hypervisor kernel, which handles only
interactions with the L1 MSRs.

If L1 wants to configure L2 to have its own private TSC (99.99% of use
cases), it should mask off access to the TSC MSR in the permission
bitmap. Then
when L2 reads or writes the MSR (not the TSC, which is protected by
the RDTSC instruction, and a different trap), an exit will be
generated to L0, which will read the permission bitmaps, determine the
access is failed, and forward the exit to L1, which will deal with
reading or writing the L2 MSR. L0 NEVER deals directly with reading or
writing the L2 MSR in this scenario, how that is handled is supposed
to be determined by L1.

If L1 wants to configure L2 to be able to read and write the L1 TSC
instead, it should enable access to the TSC MSR in the MSR permission
bitmap.  Then when L2 reads or write the MSR, an exit will be
generated to L0, which will read the permission bitmaps, determine the
access is a success, and directly read or write the L1 TSC.  This
requires changing both the offset of the running L2 (in the active L0
VMCS) and the saved offset of the inactive L1.

However, when computing the L2 tsc offset field to use for a running
guest, BOTH the L1 and L2 offsets should be added.  This is exactly
what nested hardware would do.  This implies that to be correct in the
second scenario, you need to forget the change you made to the offset
of the running L2 (in the active L0 VMCS) when the L2 guest stops
running.  You want it to be live on hardware, but not stored in any
saved TSC_OFFSET field for the L2 guest - it seems illogical, but it
is not.  The reason for that is that the L2 guest was updating ONLY
the L1 TSC, not the L2 TSC_OFFSET (which is still controlled by L1).

Note the following invariants: For N>1, L0 will never under any
circumstance set or adjust the L{N} TSC virtual VMCS field.  That is
only for the L{N-1} guest to do, and any attempt to do so by any other
than L{N-1} is not a correct virtualization.  L0 will only set the
hardware VMCS field to the offset required for virtualizing each level
L1-L{N}.

Also, the invariant for nested hardware:  the observed TSC of an L{N}
guest is equal to the hardware TSC, plus the TSC offset field for all
L1-L{N}.  If the hardware does not support nested virtualization, this
invariant must be guaranteed by the L0 hypervisor, which is simulating
nested hardware by adding up all of the nested offset fields.

Neither of these facts involves checking the status of permission bits
in the MSR map.  That is only done for determining success or failure
of an MSR read or write, which may be forwarded to some L{N} as a
virtual exit or #GP fault.

Yes, it is tricky, but once you start thinking of it in simple
hardware terms, you will realize there is no need to be clever, just
precise.  If it seems counterintuitive to think about the second case,
consider the scenario when L2 and L1 are cooperative, and L1 sets the
L2 TSC_OFFSET to zero :)  Then L2 simply manages the TSC for L1.  I
find that makes reasoning about the correct place at which to store
writes and adjustments much more intuitive.

Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux