Re: [PATCH v3] KVM: nVMX: Fully support of nested VMX preemption timer

Arthur Chunqi Li <yzt356@xxxxxxxxx> · Thu, 5 Sep 2013 16:47:28 +0800



On Thu, Sep 5, 2013 at 3:45 PM, Zhang, Yang Z <yang.z.zhang@xxxxxxxxx> wrote:
> Arthur Chunqi Li wrote on 2013-09-04:
>> This patch contains the following two changes:
>> 1. Fix the bug in nested preemption timer support. If vmexit L2->L0 with some
>> reasons not emulated by L1, preemption timer value should be save in such
>> exits.
>> 2. Add support of "Save VMX-preemption timer value" VM-Exit controls to
>> nVMX.
>>
>> With this patch, nested VMX preemption timer features are fully supported.
>>
>> Signed-off-by: Arthur Chunqi Li <yzt356@xxxxxxxxx>
>> ---
>> This series depends on queue.
>>
>>  arch/x86/include/uapi/asm/msr-index.h |    1 +
>>  arch/x86/kvm/vmx.c                    |   51
>> ++++++++++++++++++++++++++++++---
>>  2 files changed, 48 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/include/uapi/asm/msr-index.h
>> b/arch/x86/include/uapi/asm/msr-index.h
>> index bb04650..b93e09a 100644
>> --- a/arch/x86/include/uapi/asm/msr-index.h
>> +++ b/arch/x86/include/uapi/asm/msr-index.h
>> @@ -536,6 +536,7 @@
>>
>>  /* MSR_IA32_VMX_MISC bits */
>>  #define MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS (1ULL << 29)
>> +#define MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE   0x1F
>>  /* AMD-V MSRs */
>>
>>  #define MSR_VM_CR                       0xc0010114
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 1f1da43..870caa8
>> 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -2204,7 +2204,14 @@ static __init void
>> nested_vmx_setup_ctls_msrs(void)  #ifdef CONFIG_X86_64
>>               VM_EXIT_HOST_ADDR_SPACE_SIZE |
>>  #endif
>> -             VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT;
>> +             VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
>> +             VM_EXIT_SAVE_VMX_PREEMPTION_TIMER;
>> +     if (!(nested_vmx_pinbased_ctls_high &
>> PIN_BASED_VMX_PREEMPTION_TIMER))
>> +             nested_vmx_exit_ctls_high &=
>> +                     (~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER);
>> +     if (!(nested_vmx_exit_ctls_high &
>> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER))
>> +             nested_vmx_pinbased_ctls_high &=
>> +                     (~PIN_BASED_VMX_PREEMPTION_TIMER);
> The following logic is more clearly:
> if(nested_vmx_pinbased_ctls_high & PIN_BASED_VMX_PREEMPTION_TIMER)
>         nested_vmx_exit_ctls_high |= VM_EXIT_SAVE_VMX_PREEMPTION_TIMER
Here I have such consideration: this logic is wrong if CPU support
PIN_BASED_VMX_PREEMPTION_TIMER but doesn't support
VM_EXIT_SAVE_VMX_PREEMPTION_TIMER, though I don't know if this does
occurs. So the codes above reads the MSR and reserves the features it
supports, and here I just check if these two features are supported
simultaneously.

You remind that this piece of codes can write like this:
if (!(nested_vmx_pin_based_ctls_high & PIN_BASED_VMX_PREEMPTION_TIMER) ||
                !(nested_vmx_exit_ctls_high &
VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)) {
        nested_vmx_exit_ctls_high &=(~VM_EXIT_SAVE_VMX_PREEMPTION_TIMER);
        nested_vmx_pinbased_ctls_high &= (~PIN_BASED_VMX_PREEMPTION_TIMER);
}

This may reflect the logic I describe above that these two flags
should support simultaneously, and brings less confusion.
>
> BTW: I don't see nested_vmx_setup_ctls_msrs() considers the hardware's capability when expose those vmx features(not just preemption timer) to L1.
The codes just above here, when setting pinbased control for nested
vmx, it firstly "rdmsr MSR_IA32_VMX_PINBASED_CTLS", then use this to
mask the features hardware not support. So does other control fields.
>
>>       nested_vmx_exit_ctls_high |=
>> (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
>>                                     VM_EXIT_LOAD_IA32_EFER);
>>
>> @@ -6707,6 +6714,23 @@ static void vmx_get_exit_info(struct kvm_vcpu
>> *vcpu, u64 *info1, u64 *info2)
>>       *info2 = vmcs_read32(VM_EXIT_INTR_INFO);  }
>>
>> +static void nested_adjust_preemption_timer(struct kvm_vcpu *vcpu) {
>> +     u64 delta_tsc_l1;
>> +     u32 preempt_val_l1, preempt_val_l2, preempt_scale;
>> +
>> +     preempt_scale = native_read_msr(MSR_IA32_VMX_MISC) &
>> +                     MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE;
>> +     preempt_val_l2 = vmcs_read32(VMX_PREEMPTION_TIMER_VALUE);
>> +     delta_tsc_l1 = kvm_x86_ops->read_l1_tsc(vcpu,
>> +                     native_read_tsc()) - vcpu->arch.last_guest_tsc;
>> +     preempt_val_l1 = delta_tsc_l1 >> preempt_scale;
>> +     if (preempt_val_l2 - preempt_val_l1 < 0)
>> +             preempt_val_l2 = 0;
>> +     else
>> +             preempt_val_l2 -= preempt_val_l1;
>> +     vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, preempt_val_l2); }
>>  /*
>>   * The guest has exited.  See if we can fix it or if we need userspace
>>   * assistance.
>> @@ -6716,6 +6740,7 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
>>       struct vcpu_vmx *vmx = to_vmx(vcpu);
>>       u32 exit_reason = vmx->exit_reason;
>>       u32 vectoring_info = vmx->idt_vectoring_info;
>> +     int ret;
>>
>>       /* If guest state is invalid, start emulating */
>>       if (vmx->emulation_required)
>> @@ -6795,12 +6820,15 @@ static int vmx_handle_exit(struct kvm_vcpu
>> *vcpu)
>>
>>       if (exit_reason < kvm_vmx_max_exit_handlers
>>           && kvm_vmx_exit_handlers[exit_reason])
>> -             return kvm_vmx_exit_handlers[exit_reason](vcpu);
>> +             ret = kvm_vmx_exit_handlers[exit_reason](vcpu);
>>       else {
>>               vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
>>               vcpu->run->hw.hardware_exit_reason = exit_reason;
>> +             ret = 0;
>>       }
>> -     return 0;
>> +     if (is_guest_mode(vcpu))
>> +             nested_adjust_preemption_timer(vcpu);
> Move this forward to avoid the changes for ret.
The previous codes simply "return
kvm_vmx_exit_handlers[exit_reason](vcpu);", which may also consumes
CPU times. So put "nested_adjust_preemption_timer" ahead may cause the
statistics inaccuracy.
>> +     return ret;
>>  }
>>
>>  static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr) @@
>> -7518,6 +7546,7 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu,
>> struct vmcs12 *vmcs12)  {
>>       struct vcpu_vmx *vmx = to_vmx(vcpu);
>>       u32 exec_control;
>> +     u32 exit_control;
>>
>>       vmcs_write16(GUEST_ES_SELECTOR, vmcs12->guest_es_selector);
>>       vmcs_write16(GUEST_CS_SELECTOR, vmcs12->guest_cs_selector); @@
>> -7691,7 +7720,10 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu,
>> struct vmcs12 *vmcs12)
>>        * we should use its exit controls. Note that VM_EXIT_LOAD_IA32_EFER
>>        * bits are further modified by vmx_set_efer() below.
>>        */
>> -     vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
>> +     exit_control = vmcs_config.vmexit_ctrl;
>> +     if (vmcs12->pin_based_vm_exec_control &
>> PIN_BASED_VMX_PREEMPTION_TIMER)
>> +             exit_control |= VM_EXIT_SAVE_VMX_PREEMPTION_TIMER;
>> +     vmcs_write32(VM_EXIT_CONTROLS, exit_control);
> And here should be problem if host doesn't support VM_EXIT_SAVE_VMX_PREEMPTION_TIMER.
Nested vmx does check the hardware support of these features in
"nested_vmx_setup_ctls_msrs", see my comments above.
>
>>
>>       /* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE
>> are
>>        * emulated by vmx_set_efer(), below.
>> @@ -8090,6 +8122,17 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu,
>> struct vmcs12 *vmcs12)
>>       vmcs12->guest_pending_dbg_exceptions =
>>               vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
>>
>> +     if (vmcs12->pin_based_vm_exec_control &
>> +                     PIN_BASED_VMX_PREEMPTION_TIMER) {
>> +             if (vmcs12->vm_exit_controls &
>> +                             VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)
>> +                     vmcs12->vmx_preemption_timer_value =
>> +                             vmcs_read32(VMX_PREEMPTION_TIMER_VALUE);
>> +             else
>> +                     vmcs_write32(VMX_PREEMPTION_TIMER_VALUE,
>> +                                     vmcs12->vmx_preemption_timer_value);
> Why write it to vmcs directly if VM_EXIT_SAVE_VMX_PREEMPTION_TIMER not set?
Yes, writing is needless here since vmcs02 will be re-prepared via
prepare_vmcs02() when L1->L2. This function just save information
needed, vmcs_write is useless.

Arthur
>
>> +     }
>> +
>>       /*
>>        * In some cases (usually, nested EPT), L2 is allowed to change its
>>        * own CR3 without exiting. If it has changed it, we must keep it.
>> --
>> 1.7.9.5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a
>> message to majordomo@xxxxxxxxxxxxxxx More majordomo info at
>> http://vger.kernel.org/majordomo-info.html
>
> Best regards,
> Yang
>
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html