Re: [PATCH] KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT

Ladi Prosek <lprosek@xxxxxxxxxx> · Mon, 28 Nov 2016 19:12:43 +0100

On Fri, Nov 25, 2016 at 3:15 PM, Radim Krčmář <rkrcmar@xxxxxxxxxx> wrote:
> 2016-11-25 09:44+0100, Ladi Prosek:
>> On Thu, Nov 24, 2016 at 7:25 PM, Radim Krčmář <rkrcmar@xxxxxxxxxx> wrote:
>>> 2016-11-24 14:29+0100, Ladi Prosek:
>>>> KVM does not correctly handle L1 hypervisors that emulate L2 real mode with
>>>> PAE and EPT, such as Hyper-V. In this mode, the L1 hypervisor populates guest
>>>> PDPTE VMCS fields and leaves guest CR3 uninitialized because it is not used
>>>> (see 26.3.2.4 Loading Page-Directory-Pointer-Table Entries). KVM always
>>>> dereferences CR3 and tries to load PDPTEs if PAE is on. This leads to two
>>>> related issues:
>>>>
>>>> 1) On the first nested vmentry, the guest PDPTEs, as populated by L1, are
>>>> overwritten in ept_load_pdptrs because the registers are believed to have
>>>> been loaded in load_pdptrs as part of kvm_set_cr3. This is incorrect. L2 is
>>>> running with PAE enabled but PDPTRs have been set up by L1.
>>>>
>>>> 2) When L2 is about to enable paging and loads its CR3, we, again, attempt
>>>> to load PDPTEs in load_pdptrs called from kvm_set_cr3. There are no guarantees
>>>> that this will succeed (it's just a CR3 load, paging is not enabled yet) and
>>>> if it doesn't, kvm_set_cr3 returns early without persisting the CR3 which is
>>>> then lost and L2 crashes right after it enables paging.
>>>>
>>>> This patch replaces the kvm_set_cr3 call with a simple register write if PAE
>>>> and EPT are both on. CR3 is not to be interpreted in this case.
>>>>
>>>> Signed-off-by: Ladi Prosek <lprosek@xxxxxxxxxx>
>>>> ---
>>>>
>>>> tl;dr This makes Hyper-V (Windows Server 2016) work on KVM.
>>>
>>> Great work!
>>>
>>>> @@ -10113,8 +10115,17 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
>>>>       vmx_set_cr4(vcpu, vmcs12->guest_cr4);
>>>>       vmcs_writel(CR4_READ_SHADOW, nested_read_cr4(vmcs12));
>>>>
>>>> -     /* shadow page tables on either EPT or shadow page tables */
>>>> -     kvm_set_cr3(vcpu, vmcs12->guest_cr3);
>>>> +     /*
>>>> +      * Shadow page tables on either EPT or shadow page tables.
>>>> +      * If PAE and EPT are both on, CR3 is not used by the CPU and must not
>>>> +      * be dereferenced.
>>>> +      */
>>>> +     if (is_pae(vcpu) && is_paging(vcpu) && nested_ept_enabled) {
>>>
>>> The spec also has a "&& !is_long_mode(vcpu)" condition for PAE paging.
>>> I'm pretty sure that is_pae() and is_paging() is set in long mode ...
>>> the stuff in kvm_set_cr3() is not needed even then?
>>
>> Good catch, yes the spec also has !is_long_mode and it should be here
>> for completeness. It won't make much of a difference though, I believe
>> that kvm_set_cr3 is to a large extent unnecessary here.
>
> I agree.  The patch is in kvm/queue to get some testing -- I'll replace
> it if you write an update.

Yes, I am planning to send a new version.

>> What kvm_set_cr3 does:
>>
>> * conditional  kvm_mmu_sync_roots + kvm_make_request(KVM_REQ_TLB_FLUSH)
>> ** kvm_mmu_sync_roots will run anyway: kvm_mmu_load <- kvm_mmu_reload
>> <- vcpu_enter_guest
>> ** tlb flush will be done anyway: vmx_flush_tlb <- vmx_set_cr3 <-
>> kvm_mmu_load <- kvm_mmu_reload <- vcpu_enter_guest
>>
>> * in long mode, it fails if (cr3 & CR3_L_MODE_RESERVED_BITS)
>> ** nobody checks the return value
>> ** Intel manual says "Reserved bits in CR0 and CR3 remain clear after
>> any load of those registers; attempts to set them have no impact."
>> Should we just clear the bits and carry on then? This is in conflict
>> with "#GP(0) .. If an attempt is made to write a 1 to any reserved bit
>> in CR3." Hmm.
>
> The spec is quite clear on this.  26.3.1.1 Checks on Guest Control
> Registers, Debug Registers, and MSRs:
>
>   The following checks are performed on processors that support Intel 64
>   architecture: The CR3 field must be such that bits 63:52 and bits in
>   the range 51:32 beyond the processor’s physical-address width are 0.
>
> To verify, I tried these two options on top of vmx_vcpu_run
>
>   vmcs_writel(GUEST_CR3, vmcs_readl(GUEST_CR3) | 1UL << boot_cpu_data.x86_phys_bits);
>   vmcs_writel(GUEST_CR3, vmcs_readl(GUEST_CR3) | CR3_PCID_INVD);
>
> and both failed VM entry.  We should fail the nested VM entry as well
> and use cpuid_maxphyaddr() to determine when.

Thanks, I hadn't realized that the rules for VM entry are different
from regular CR3 loads. One more reason for not using kvm_set_cr3
here.

> (And I have a bad feeling that guest's physical address width is not
>  being limited by hardware's ...)

Can you elaborate? In which MMU modes would it be causing problems?

>> * if (is_pae(vcpu) && is_paging(vcpu), load_pdptrs is executed
>> ** seems legit!
>>
>> * kvm_mmu_new_cr3
>> ** will run anyway, handled in kvm_mmu_reset_context called right
>> after kvm_set_cr3
>>
>> I am tempted to make more invasive changes. What do you think?
>
> Go ahead!
>
> Thanks.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html