On Thu, Sep 5, 2013 at 7:05 PM, Zhang, Yang Z <yang.z.zhang@xxxxxxxxx> wrote: > Arthur Chunqi Li wrote on 2013-09-05: >> > Arthur Chunqi Li wrote on 2013-09-05: >> >> On Thu, Sep 5, 2013 at 3:45 PM, Zhang, Yang Z >> >> <yang.z.zhang@xxxxxxxxx> >> >> wrote: >> >> > Arthur Chunqi Li wrote on 2013-09-04: >> >> >> This patch contains the following two changes: >> >> >> 1. Fix the bug in nested preemption timer support. If vmexit >> >> >> L2->L0 with some reasons not emulated by L1, preemption timer >> >> >> value should be save in such exits. >> >> >> 2. Add support of "Save VMX-preemption timer value" VM-Exit >> >> >> controls to nVMX. >> >> >> >> >> >> With this patch, nested VMX preemption timer features are fully >> supported. >> >> >> >> >> >> Signed-off-by: Arthur Chunqi Li <yzt356@xxxxxxxxx> >> >> >> --- >> >> >> This series depends on queue. >> >> >> >> >> >> arch/x86/include/uapi/asm/msr-index.h | 1 + >> >> >> arch/x86/kvm/vmx.c | 51 >> >> >> ++++++++++++++++++++++++++++++--- >> >> >> 2 files changed, 48 insertions(+), 4 deletions(-) >> >> >> >> >> >> diff --git a/arch/x86/include/uapi/asm/msr-index.h >> >> >> b/arch/x86/include/uapi/asm/msr-index.h >> >> >> index bb04650..b93e09a 100644 >> >> >> --- a/arch/x86/include/uapi/asm/msr-index.h >> >> >> +++ b/arch/x86/include/uapi/asm/msr-index.h >> >> >> @@ -536,6 +536,7 @@ >> >> >> >> >> >> /* MSR_IA32_VMX_MISC bits */ >> >> >> #define MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS (1ULL >> << >> >> 29) >> >> >> +#define MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE 0x1F >> >> >> /* AMD-V MSRs */ >> >> >> >> >> >> #define MSR_VM_CR 0xc0010114 >> >> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index >> >> >> 1f1da43..870caa8 >> >> >> 100644 >> >> >> --- a/arch/x86/kvm/vmx.c >> >> >> +++ b/arch/x86/kvm/vmx.c >> >> >> @@ -2204,7 +2204,14 @@ static __init void >> >> >> nested_vmx_setup_ctls_msrs(void) #ifdef CONFIG_X86_64 >> >> >> VM_EXIT_HOST_ADDR_SPACE_SIZE | #endif >> >> >> - VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT; >> >> >> + VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT >> | >> >> >> + VM_EXIT_SAVE_VMX_PREEMPTION_TIMER; >> >> >> + if (!(nested_vmx_pinbased_ctls_high & >> >> >> PIN_BASED_VMX_PREEMPTION_TIMER)) >> >> >> + nested_vmx_exit_ctls_high &= >> >> >> + >> (~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER); >> >> >> + if (!(nested_vmx_exit_ctls_high & >> >> >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) >> >> >> + nested_vmx_pinbased_ctls_high &= >> >> >> + (~PIN_BASED_VMX_PREEMPTION_TIMER); >> >> > The following logic is more clearly: >> >> > if(nested_vmx_pinbased_ctls_high & >> >> PIN_BASED_VMX_PREEMPTION_TIMER) >> >> > nested_vmx_exit_ctls_high |= >> >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER >> >> Here I have such consideration: this logic is wrong if CPU support >> >> PIN_BASED_VMX_PREEMPTION_TIMER but doesn't support >> >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER, though I don't know if this >> does >> >> occurs. So the codes above reads the MSR and reserves the features it >> >> supports, and here I just check if these two features are supported >> >> simultaneously. >> >> >> > No. Only VM_EXIT_SAVE_VMX_PREEMPTION_TIMER depends on >> > PIN_BASED_VMX_PREEMPTION_TIMER. >> PIN_BASED_VMX_PREEMPTION_TIMER is an >> > independent feature >> > >> >> You remind that this piece of codes can write like this: >> >> if (!(nested_vmx_pin_based_ctls_high & >> >> PIN_BASED_VMX_PREEMPTION_TIMER) || >> >> !(nested_vmx_exit_ctls_high & >> >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) { >> >> nested_vmx_exit_ctls_high >> >> &=(~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER); >> >> nested_vmx_pinbased_ctls_high &= >> >> (~PIN_BASED_VMX_PREEMPTION_TIMER); >> >> } >> >> >> >> This may reflect the logic I describe above that these two flags >> >> should support simultaneously, and brings less confusion. >> >> > >> >> > BTW: I don't see nested_vmx_setup_ctls_msrs() considers the >> >> > hardware's >> >> capability when expose those vmx features(not just preemption timer) to L1. >> >> The codes just above here, when setting pinbased control for nested >> >> vmx, it firstly "rdmsr MSR_IA32_VMX_PINBASED_CTLS", then use this to >> >> mask the features hardware not support. So does other control fields. >> >> > >> > Yes, I saw it. >> > >> >> >> nested_vmx_exit_ctls_high |= >> >> >> (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR | >> >> >> >> VM_EXIT_LOAD_IA32_EFER); >> >> >> >> >> >> @@ -6707,6 +6714,23 @@ static void vmx_get_exit_info(struct >> >> >> kvm_vcpu *vcpu, u64 *info1, u64 *info2) >> >> >> *info2 = vmcs_read32(VM_EXIT_INTR_INFO); } >> >> >> >> >> >> +static void nested_adjust_preemption_timer(struct kvm_vcpu *vcpu) { >> >> >> + u64 delta_tsc_l1; >> >> >> + u32 preempt_val_l1, preempt_val_l2, preempt_scale; >> >> >> + >> >> >> + preempt_scale = native_read_msr(MSR_IA32_VMX_MISC) & >> >> >> + >> >> MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE; >> >> >> + preempt_val_l2 = >> vmcs_read32(VMX_PREEMPTION_TIMER_VALUE); >> >> >> + delta_tsc_l1 = kvm_x86_ops->read_l1_tsc(vcpu, >> >> >> + native_read_tsc()) - >> vcpu->arch.last_guest_tsc; >> >> >> + preempt_val_l1 = delta_tsc_l1 >> preempt_scale; >> >> >> + if (preempt_val_l2 - preempt_val_l1 < 0) >> >> >> + preempt_val_l2 = 0; >> >> >> + else >> >> >> + preempt_val_l2 -= preempt_val_l1; >> >> >> + vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, >> >> preempt_val_l2); } >> >> >> /* >> >> >> * The guest has exited. See if we can fix it or if we need userspace >> >> >> * assistance. >> >> >> @@ -6716,6 +6740,7 @@ static int vmx_handle_exit(struct kvm_vcpu >> >> *vcpu) >> >> >> struct vcpu_vmx *vmx = to_vmx(vcpu); >> >> >> u32 exit_reason = vmx->exit_reason; >> >> >> u32 vectoring_info = vmx->idt_vectoring_info; >> >> >> + int ret; >> >> >> >> >> >> /* If guest state is invalid, start emulating */ >> >> >> if (vmx->emulation_required) @@ -6795,12 +6820,15 @@ static >> >> >> int vmx_handle_exit(struct kvm_vcpu >> >> >> *vcpu) >> >> >> >> >> >> if (exit_reason < kvm_vmx_max_exit_handlers >> >> >> && kvm_vmx_exit_handlers[exit_reason]) >> >> >> - return kvm_vmx_exit_handlers[exit_reason](vcpu); >> >> >> + ret = kvm_vmx_exit_handlers[exit_reason](vcpu); >> >> >> else { >> >> >> vcpu->run->exit_reason = KVM_EXIT_UNKNOWN; >> >> >> vcpu->run->hw.hardware_exit_reason = exit_reason; >> >> >> + ret = 0; >> >> >> } >> >> >> - return 0; >> >> >> + if (is_guest_mode(vcpu)) >> >> >> + nested_adjust_preemption_timer(vcpu); >> >> > Move this forward to avoid the changes for ret. >> >> The previous codes simply "return >> >> kvm_vmx_exit_handlers[exit_reason](vcpu);", which may also consumes >> >> CPU times. So put "nested_adjust_preemption_timer" ahead may cause >> >> the statistics inaccuracy. >> > Then you should put it before vmentry. Here still far from the point of >> vmentry. >> So where is the actual point of vmentry? I'm not quite familiar with that piece >> of codes. > vmx_vcpu_run should be the right place. > >> > >> >> >> + return ret; >> >> >> } >> >> >> >> >> >> static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, >> >> >> int >> >> >> irr) @@ >> >> >> -7518,6 +7546,7 @@ static void prepare_vmcs02(struct kvm_vcpu >> >> >> *vcpu, struct vmcs12 *vmcs12) { >> >> >> struct vcpu_vmx *vmx = to_vmx(vcpu); >> >> >> u32 exec_control; >> >> >> + u32 exit_control; >> >> >> >> >> >> vmcs_write16(GUEST_ES_SELECTOR, >> vmcs12->guest_es_selector); >> >> >> vmcs_write16(GUEST_CS_SELECTOR, >> vmcs12->guest_cs_selector); >> >> @@ >> >> >> -7691,7 +7720,10 @@ static void prepare_vmcs02(struct kvm_vcpu >> >> >> *vcpu, struct vmcs12 *vmcs12) >> >> >> * we should use its exit controls. Note that >> >> VM_EXIT_LOAD_IA32_EFER >> >> >> * bits are further modified by vmx_set_efer() below. >> >> >> */ >> >> >> - vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl); >> >> >> + exit_control = vmcs_config.vmexit_ctrl; >> >> >> + if (vmcs12->pin_based_vm_exec_control & >> >> >> PIN_BASED_VMX_PREEMPTION_TIMER) >> >> >> + exit_control |= >> >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER; >> >> >> + vmcs_write32(VM_EXIT_CONTROLS, exit_control); >> >> > And here should be problem if host doesn't support >> >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER. >> >> Nested vmx does check the hardware support of these features in >> >> "nested_vmx_setup_ctls_msrs", see my comments above. >> > But here you don't check it. You just set it unconditionally. >> What we have checked in nested_vmx_setup_ctls_msrs() confirms that only if >> these two features supported by hardware, nested vmx can enable preemption >> timer. That is to say, if L2 can set PIN_BASED_VMX_PREEMPTION_TIMER >> without failure, hardware must support >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER, which should guarantee in >> nested_vmx_setup_ctls_msrs (). (This is also why I write code like that in >> nested_vmx_setup_ctls_msrs ()). > I see your point. But the fact is nested_vmx_setup_ctls_msrs s an independent feature. So your previous assumption is wrong. The hardware may support PIN_BASED_VMX_PREEMPTION_TIMER without VM_EXIT_SAVE_VMX_PREEMPTION_TIMER I know what you mean, but see the changes of nested_vmx_setup_ctls_msrs() in this patch. "nested_vmx_setup_ctls_msrs()" is used to advertise features to nested VM. The changes to this function confirms that nested guest can set PIN_BASED_VMX_PREEMPTION_TIMER only if hardware support both PIN_BASED_VMX_PREEMPTION_TIMER and VM_EXIT_SAVE_VMX_PREEMPTION_TIMER. So I don't need to check here because if VM_EXIT_SAVE_VMX_PREEMPTION_TIMER is not supported by hardware, both features will not exploit to nested VM. Nested VM cannot vmenter if they set any, so they of course can't reach here. > >> > >> >> > >> >> >> >> >> >> /* vmcs12's VM_ENTRY_LOAD_IA32_EFER and >> >> VM_ENTRY_IA32E_MODE are >> >> >> * emulated by vmx_set_efer(), below. >> >> >> @@ -8090,6 +8122,17 @@ static void prepare_vmcs12(struct kvm_vcpu >> >> >> *vcpu, struct vmcs12 *vmcs12) >> >> >> vmcs12->guest_pending_dbg_exceptions = >> >> >> vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS); >> >> >> >> >> >> + if (vmcs12->pin_based_vm_exec_control & >> >> >> + PIN_BASED_VMX_PREEMPTION_TIMER) { >> >> >> + if (vmcs12->vm_exit_controls & >> >> >> + >> >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER) >> >> >> + vmcs12->vmx_preemption_timer_value = >> >> >> + >> >> vmcs_read32(VMX_PREEMPTION_TIMER_VALUE); >> >> >> + else >> >> >> + >> >> vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, >> >> >> + >> >> >> + vmcs12->vmx_preemption_timer_value); >> >> > Why write it to vmcs directly if >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER >> >> not set? >> >> Yes, writing is needless here since vmcs02 will be re-prepared via >> >> prepare_vmcs02() when L1->L2. This function just save information >> >> needed, vmcs_write is useless. >> >> >> > I don't sure I am seeing your point. Per my point, even the >> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER is not setting, you still need to >> save the value to vmcs12->vmx_preemption_timer_value. Or else, >> prepare_vmcs02 cannot get the right value. >> If L2 doesn't enable VM_EXIT_SAVE_VMX_PREEMPTION_TIMER, its value will >> be reset to what we set initially. This function is called due to >> L2->L1, so we should emulate L2's real exit process here. What we have >> done in other parts is to handle vmexit of L2->L0->L2, which is not the "real" L2 >> vmexit. > I am not describing it clearly. I mean if VM_EXIT_SAVE_VMX_PREEMPTION_TIMER is not set, then prepare_vmcs02() always write the initial value which means vmx_preemption_timer_value to vmcs02 before L2 running. So the vmcs wirte here is redundant. Am I wrong? Yes, I get your point and I meant the same, vmcs_write is redundant. I may confuse you by my poor description. ;) Arthur > >> >> Arthur >> > >> >> Arthur >> >> > >> >> >> + } >> >> >> + >> >> >> /* >> >> >> * In some cases (usually, nested EPT), L2 is allowed to change >> its >> >> >> * own CR3 without exiting. If it has changed it, we must keep it. >> >> >> -- >> >> >> 1.7.9.5 >> >> >> >> >> >> -- >> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in >> >> >> the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo >> >> >> info at http://vger.kernel.org/majordomo-info.html >> >> > >> >> > Best regards, >> >> > Yang >> >> > >> >> > >> >> -- >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in the >> >> body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at >> >> http://vger.kernel.org/majordomo-info.html >> > >> > Best regards, >> > Yang >> > >> > > > Best regards, > Yang > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html