On Fri, Nov 25, 2016 at 3:15 PM, Radim Krčmář <rkrcmar@xxxxxxxxxx> wrote: > 2016-11-25 09:44+0100, Ladi Prosek: >> On Thu, Nov 24, 2016 at 7:25 PM, Radim Krčmář <rkrcmar@xxxxxxxxxx> wrote: >>> 2016-11-24 14:29+0100, Ladi Prosek: >>>> KVM does not correctly handle L1 hypervisors that emulate L2 real mode with >>>> PAE and EPT, such as Hyper-V. In this mode, the L1 hypervisor populates guest >>>> PDPTE VMCS fields and leaves guest CR3 uninitialized because it is not used >>>> (see 26.3.2.4 Loading Page-Directory-Pointer-Table Entries). KVM always >>>> dereferences CR3 and tries to load PDPTEs if PAE is on. This leads to two >>>> related issues: >>>> >>>> 1) On the first nested vmentry, the guest PDPTEs, as populated by L1, are >>>> overwritten in ept_load_pdptrs because the registers are believed to have >>>> been loaded in load_pdptrs as part of kvm_set_cr3. This is incorrect. L2 is >>>> running with PAE enabled but PDPTRs have been set up by L1. >>>> >>>> 2) When L2 is about to enable paging and loads its CR3, we, again, attempt >>>> to load PDPTEs in load_pdptrs called from kvm_set_cr3. There are no guarantees >>>> that this will succeed (it's just a CR3 load, paging is not enabled yet) and >>>> if it doesn't, kvm_set_cr3 returns early without persisting the CR3 which is >>>> then lost and L2 crashes right after it enables paging. >>>> >>>> This patch replaces the kvm_set_cr3 call with a simple register write if PAE >>>> and EPT are both on. CR3 is not to be interpreted in this case. >>>> >>>> Signed-off-by: Ladi Prosek <lprosek@xxxxxxxxxx> >>>> --- >>>> >>>> tl;dr This makes Hyper-V (Windows Server 2016) work on KVM. >>> >>> Great work! >>> >>>> @@ -10113,8 +10115,17 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) >>>> vmx_set_cr4(vcpu, vmcs12->guest_cr4); >>>> vmcs_writel(CR4_READ_SHADOW, nested_read_cr4(vmcs12)); >>>> >>>> - /* shadow page tables on either EPT or shadow page tables */ >>>> - kvm_set_cr3(vcpu, vmcs12->guest_cr3); >>>> + /* >>>> + * Shadow page tables on either EPT or shadow page tables. >>>> + * If PAE and EPT are both on, CR3 is not used by the CPU and must not >>>> + * be dereferenced. >>>> + */ >>>> + if (is_pae(vcpu) && is_paging(vcpu) && nested_ept_enabled) { >>> >>> The spec also has a "&& !is_long_mode(vcpu)" condition for PAE paging. >>> I'm pretty sure that is_pae() and is_paging() is set in long mode ... >>> the stuff in kvm_set_cr3() is not needed even then? >> >> Good catch, yes the spec also has !is_long_mode and it should be here >> for completeness. It won't make much of a difference though, I believe >> that kvm_set_cr3 is to a large extent unnecessary here. > > I agree. The patch is in kvm/queue to get some testing -- I'll replace > it if you write an update. Yes, I am planning to send a new version. >> What kvm_set_cr3 does: >> >> * conditional kvm_mmu_sync_roots + kvm_make_request(KVM_REQ_TLB_FLUSH) >> ** kvm_mmu_sync_roots will run anyway: kvm_mmu_load <- kvm_mmu_reload >> <- vcpu_enter_guest >> ** tlb flush will be done anyway: vmx_flush_tlb <- vmx_set_cr3 <- >> kvm_mmu_load <- kvm_mmu_reload <- vcpu_enter_guest >> >> * in long mode, it fails if (cr3 & CR3_L_MODE_RESERVED_BITS) >> ** nobody checks the return value >> ** Intel manual says "Reserved bits in CR0 and CR3 remain clear after >> any load of those registers; attempts to set them have no impact." >> Should we just clear the bits and carry on then? This is in conflict >> with "#GP(0) .. If an attempt is made to write a 1 to any reserved bit >> in CR3." Hmm. > > The spec is quite clear on this. 26.3.1.1 Checks on Guest Control > Registers, Debug Registers, and MSRs: > > The following checks are performed on processors that support Intel 64 > architecture: The CR3 field must be such that bits 63:52 and bits in > the range 51:32 beyond the processor’s physical-address width are 0. > > To verify, I tried these two options on top of vmx_vcpu_run > > vmcs_writel(GUEST_CR3, vmcs_readl(GUEST_CR3) | 1UL << boot_cpu_data.x86_phys_bits); > vmcs_writel(GUEST_CR3, vmcs_readl(GUEST_CR3) | CR3_PCID_INVD); > > and both failed VM entry. We should fail the nested VM entry as well > and use cpuid_maxphyaddr() to determine when. Thanks, I hadn't realized that the rules for VM entry are different from regular CR3 loads. One more reason for not using kvm_set_cr3 here. > (And I have a bad feeling that guest's physical address width is not > being limited by hardware's ...) Can you elaborate? In which MMU modes would it be causing problems? >> * if (is_pae(vcpu) && is_paging(vcpu), load_pdptrs is executed >> ** seems legit! >> >> * kvm_mmu_new_cr3 >> ** will run anyway, handled in kvm_mmu_reset_context called right >> after kvm_set_cr3 >> >> I am tempted to make more invasive changes. What do you think? > > Go ahead! > > Thanks. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html