Re: [PATCH 18/24] Exiting from L2 to L1

Avi Kivity <avi@xxxxxxxxxx> · Mon, 14 Jun 2010 15:04:08 +0300

On 06/13/2010 03:31 PM, Nadav Har'El wrote:
This patch implements nested_vmx_vmexit(), called when the nested L2 guest
exits and we want to run its L1 parent and let it handle this exit.

Note that this will not necessarily be called on every L2 exit. L0 may decide
to handle a particular exit on its own, without L1's involvement; In that
case, L0 will handle the exit, and resume running L2, without running L1 and
without calling nested_vmx_vmexit(). The logic for deciding whether to handle
a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(),
will appear in the next patch.

+/* prepare_vmcs_12 is called when the nested L2 guest exits and we want to
+ * prepare to run its L1 parent. L1 keeps a vmcs for L2 (vmcs12), and this
+ * function updates it to reflect the state of the registers during the exit,
+ * and to reflect some changes that happened while L2 was running (and perhaps
+ * made some exits which were handled directly by L0 without going back to L1).
+ */
+void prepare_vmcs_12(struct kvm_vcpu *vcpu)
+{
+	struct shadow_vmcs *vmcs12 = get_shadow_vmcs(vcpu);
+
+	vmcs12->guest_es_selector = vmcs_read16(GUEST_ES_SELECTOR);
+	vmcs12->guest_cs_selector = vmcs_read16(GUEST_CS_SELECTOR);
+	vmcs12->guest_ss_selector = vmcs_read16(GUEST_SS_SELECTOR);
+	vmcs12->guest_ds_selector = vmcs_read16(GUEST_DS_SELECTOR);
+	vmcs12->guest_fs_selector = vmcs_read16(GUEST_FS_SELECTOR);
+	vmcs12->guest_gs_selector = vmcs_read16(GUEST_GS_SELECTOR);
+	vmcs12->guest_ldtr_selector = vmcs_read16(GUEST_LDTR_SELECTOR);
+	vmcs12->guest_tr_selector = vmcs_read16(GUEST_TR_SELECTOR);
+
+	vmcs12->tsc_offset = vmcs_read64(TSC_OFFSET);

TSC_OFFSET cannot have changed.

+	vmcs12->guest_physical_address = vmcs_read64(GUEST_PHYSICAL_ADDRESS);

Not available without EPT.

+	vmcs12->vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER);

Can this change?

+	vmcs12->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);

Without msr bitmaps, cannot change.

+	if (vmcs_config.vmentry_ctrl&  VM_ENTRY_LOAD_IA32_PAT)
+		vmcs12->guest_ia32_pat = vmcs_read64(GUEST_IA32_PAT);

Should check for VM_EXIT_SAVE_IA32_PAT, no?  Also unneeded without msr 
bitmaps and passthrough for this msr.

+	vmcs12->cr3_target_count = vmcs_read32(CR3_TARGET_COUNT);

R/O

+	vmcs12->vm_entry_intr_info_field =
+		vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);

Autocleared, no need to read.

+	vmcs12->vm_entry_exception_error_code =
+		vmcs_read32(VM_ENTRY_EXCEPTION_ERROR_CODE);
+	vmcs12->vm_entry_instruction_len =
+		vmcs_read32(VM_ENTRY_INSTRUCTION_LEN);

R/O

+	vmcs12->vm_instruction_error = vmcs_read32(VM_INSTRUCTION_ERROR);

We don't want to pass this to the guest?

+	vmcs12->vm_exit_reason  = vmcs_read32(VM_EXIT_REASON);
+	vmcs12->vm_exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
+	vmcs12->vm_exit_intr_error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
+	vmcs12->idt_vectoring_info_field =
+		vmcs_read32(IDT_VECTORING_INFO_FIELD);
+	vmcs12->idt_vectoring_error_code =
+		vmcs_read32(IDT_VECTORING_ERROR_CODE);
+	vmcs12->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
+	vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);

For the above, if the host handles the exit, we must not clobber guest 
fields.  A subsequent guest vmread will see the changed values even 
though from its point of view a vmexit has not occurred.

But no, that can't happen, since a vmread needs to have a vmexit first 
to happen.  Still, best to delay this.+
+	/* If any of the CRO_GUEST_HOST_MASK bits are off, the L2 guest may
+	 * have changed some cr0 bits without us ever saving them in the shadow
+	 * vmcs. So we need to save these changes now.
+	 * In the current code, the only GHM bit which can be off is TS (it
+	 * will be off when fpu_active and L1 also set it to off).
+	 */
+	vmcs12->guest_cr0 = vmcs_readl(GUEST_CR0);
+
+	/* But this may not be the guest_cr0 that the L1 guest hypervisor
+	 * actually thought it was giving its L2 guest. It is possible that
+	 * L1 wished to allow its guest to set a cr0 bit directly, but we (L0)
+	 * captured this attempt and instead set just the read shadow. If this
+	 * is the case, we need copy these read-shadow bits back to guest_cr0,
+	 * where L1 believes they already are. Note that we must read the
+	 * actual CR0_READ_SHADOW (which is what L0 may have changed), not
+	 * vmcs12->cr0_read_shadow (which L1 defined, and we don't
+	 * change without being told by L1). Currently, the only bit where
+	 * this can happen is TS.
+	 */
+	if (!(vcpu->arch.cr0_guest_owned_bits&  X86_CR0_TS)
+			&&  !(vmcs12->cr0_guest_host_mask&  X86_CR0_TS))
+		vmcs12->guest_cr0 =
+			(vmcs12->guest_cr0&  ~X86_CR0_TS) |
+			(vmcs_readl(CR0_READ_SHADOW)&  X86_CR0_TS);
+
+	vmcs12->guest_cr4 = vmcs_readl(GUEST_CR4);

Can't we have the same issue with cr4?

Better to have some helpers to do the common magic, and not encode the 
special knowledge about TS into it (make it generic).

+
+int switch_back_vmcs(struct kvm_vcpu *vcpu)
+{
+	struct shadow_vmcs *src = to_vmx(vcpu)->nested.l1_shadow_vmcs;
+
+	if (enable_vpid&&  src->virtual_processor_id != 0)
+		vmcs_write16(VIRTUAL_PROCESSOR_ID, src->virtual_processor_id);

IIUC vpids are not exposed to the guest yet?  So the VPID should not 
change between guest and nested guest.

+
+	vmcs_write64(IO_BITMAP_A, src->io_bitmap_a);
+	vmcs_write64(IO_BITMAP_B, src->io_bitmap_b);

Why change the I/O bitmap?

+
+	if (cpu_has_vmx_msr_bitmap())
+		vmcs_write64(MSR_BITMAP, src->msr_bitmap);

Or the msr bitmap?  After all, we're switching the entire vmcs?

+
+	vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, src->virtual_apic_page_addr);
+
+	if (vm_need_virtualize_apic_accesses(vcpu->kvm))
+		vmcs_write64(APIC_ACCESS_ADDR,
+			     src->apic_access_addr);
+
+	if (enable_ept) {
+		vmcs_write64(EPT_POINTER, src->ept_pointer);
+		vmcs_write64(GUEST_PDPTR0, src->guest_pdptr0);
+		vmcs_write64(GUEST_PDPTR1, src->guest_pdptr1);
+		vmcs_write64(GUEST_PDPTR2, src->guest_pdptr2);
+		vmcs_write64(GUEST_PDPTR3, src->guest_pdptr3);
+	}

A kvm_set_cr3(src->host_cr3) should do all that and more, no?

+
+	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL, src->pin_based_vm_exec_control);
+	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, src->cpu_based_vm_exec_control);
+	vmcs_write32(EXCEPTION_BITMAP, src->exception_bitmap);
+	vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK,
+		     src->page_fault_error_code_mask);
+	vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH,
+		     src->page_fault_error_code_match);
+	vmcs_write32(VM_EXIT_CONTROLS, src->vm_exit_controls);
+	vmcs_write32(VM_ENTRY_CONTROLS, src->vm_entry_controls);

Why write all these?  What could have changed them?

+
+	if (cpu_has_secondary_exec_ctrls())
+		vmcs_write32(SECONDARY_VM_EXEC_CONTROL,
+			     src->secondary_vm_exec_control);
+
+	load_vmcs_common(src);
+
+	load_vmcs_host_state(to_vmx(vcpu)->nested.l1_shadow_vmcs);
+
+	return 0;
+}
+
+static int nested_vmx_vmexit(struct kvm_vcpu *vcpu,
+			     bool is_interrupt)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	int efer_offset;
+
+	if (!vmx->nested.nested_mode) {
+		printk(KERN_INFO "WARNING: %s called but not in nested mode\n",
+		       __func__);
+		return 0;
+	}
+
+	sync_cached_regs_to_vmcs(vcpu);
+
+	if (!nested_map_current(vcpu)) {
+		printk(KERN_INFO "Error mapping shadow vmcs\n");
+		set_rflags_to_vmx_fail_valid(vcpu);
+		return 1;
+	}
+
+	prepare_vmcs_12(vcpu);
+	if (is_interrupt)
+		get_shadow_vmcs(vcpu)->vm_exit_reason =
+			EXIT_REASON_EXTERNAL_INTERRUPT;
+
+	vmx->nested.current_l2_page->launched = vmx->launched;
+	vmx->nested.current_l2_page->cpu = vcpu->cpu;
+
+	nested_unmap_current(vcpu);
+
+	vmx->vmcs = vmx->nested.l1_vmcs;
+	vcpu->cpu = vmx->nested.l1_state.cpu;
+	vmx->launched = vmx->nested.l1_state.launched;
+
+	vmx_vcpu_load(vcpu, get_cpu());
+	put_cpu();
+
+	vcpu->arch.efer = vmx->nested.l1_state.efer;
+	if ((vcpu->arch.efer&  EFER_LMA)&&
+	    !(vcpu->arch.efer&  EFER_SCE))
+		vcpu->arch.efer |= EFER_SCE;
+
+	efer_offset = __find_msr_index(vmx, MSR_EFER);
+	if (update_transition_efer(vmx, efer_offset))
+		wrmsrl(MSR_EFER, vmx->guest_msrs[efer_offset].data);
+
+	/* We're running a regular L1 guest again, so we do the regular KVM
+	 * thing: run vmx_set_cr0 with the cr0 bits the guest thinks it has
+	 * (this can be figured out by combining its old guest_cr0 and
+	 * cr0_read_shadow, using the cr0_guest_host_mask). vmx_set_cr0 might
+	 * use slightly different bits on the new guest_cr0 it sets, e.g.,
+	 * add TS when !fpu_active.
+	 */
+	vmx_set_cr0(vcpu,
+		(vmx->nested.l1_shadow_vmcs->cr0_guest_host_mask&
+		vmx->nested.l1_shadow_vmcs->cr0_read_shadow) |
+		(~vmx->nested.l1_shadow_vmcs->cr0_guest_host_mask&
+		vmx->nested.l1_shadow_vmcs->guest_cr0));

Helper wanted.

+
+	vmx_set_cr4(vcpu, vmx->nested.l1_state.cr4);
+

Again, the kvm_set_crx() versions have more meat.

+	if (enable_ept) {
+		vcpu->arch.cr3 = vmx->nested.l1_shadow_vmcs->guest_cr3;
+		vmcs_write32(GUEST_CR3, vmx->nested.l1_shadow_vmcs->guest_cr3);
+	} else {
+		kvm_set_cr3(vcpu, vmx->nested.l1_state.cr3);
+	}

kvm_set_cr3() will load the PDPTRs in the EPT case (correctly in case 
the nested guest was able to corrupted the guest's PDPT).

+
+	if (!nested_map_current(vcpu)) {
+		printk(KERN_INFO "Error mapping shadow vmcs\n");
+		set_rflags_to_vmx_fail_valid(vcpu);
+		return 1;
+	}
+
+	switch_back_vmcs(vcpu);
+
+	nested_unmap_current(vcpu);
+
+	kvm_register_write(vcpu, VCPU_REGS_RSP,
+			   vmx->nested.l1_shadow_vmcs->guest_rsp);
+	kvm_register_write(vcpu, VCPU_REGS_RIP,
+			   vmx->nested.l1_shadow_vmcs->guest_rip);
+
+	vmx->nested.nested_mode = 0;
+
+	/* If we did fpu_activate()/fpu_deactive() during l2's run, we need
+	 * to apply the same changes also when running l1. We don't need to
+	 * change cr0 here - we already did this above - just the
+	 * cr0_guest_host_mask, and exception bitmap.
+	 */
+	vmcs_write32(EXCEPTION_BITMAP,
+		(vmx->nested.l1_shadow_vmcs->exception_bitmap&
+			~(1u<<NM_VECTOR)) |
+			(vcpu->fpu_active ? 0 : (1u<<NM_VECTOR)));
+	vcpu->arch.cr0_guest_owned_bits = (vcpu->fpu_active ? X86_CR0_TS : 0);
+	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
+
+	kvm_mmu_reset_context(vcpu);
+	kvm_mmu_load(vcpu);

kvm_mmu_load() unneeded, usually.

+
+	if (unlikely(vmx->fail)) {
+		vmx->fail = 0;
+		set_rflags_to_vmx_fail_valid(vcpu);
+	} else
+		clear_rflags_cf_zf(vcpu);
+
+	return 0;
+}
+
  static struct kvm_x86_ops vmx_x86_ops = {
  	.cpu_has_kvm_support = cpu_has_kvm_support,
  	.disabled_by_bios = vmx_disabled_by_bios,

I'm probably missing something about the read/write of various vmcs fields.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html