On 2013-08-25 10:41, Arthur Chunqi Li wrote: > On Sun, Aug 25, 2013 at 4:18 PM, Abel Gordon <ABELG@xxxxxxxxxx> wrote: >> >> >> kvm-owner@xxxxxxxxxxxxxxx wrote on 25/08/2013 10:54:13 AM: >> >>> From: Jan Kiszka <jan.kiszka@xxxxxx> >>> To: Abel Gordon/Haifa/IBM@IBMIL, >>> Cc: gleb@xxxxxxxxxx, kvm <kvm@xxxxxxxxxxxxxxx>, pbonzini@xxxxxxxxxx, >>> "李春奇 <Arthur Chunqi Li>" <yzt356@xxxxxxxxx> >>> Date: 25/08/2013 10:54 AM >>> Subject: Re: [PATCH] KVM: nVMX: Fully support of nested VMX preemption >> timer >>> Sent by: kvm-owner@xxxxxxxxxxxxxxx >>> >>> On 2013-08-25 09:50, Abel Gordon wrote: >>>> >>>> >>>> kvm-owner@xxxxxxxxxxxxxxx wrote on 25/08/2013 10:43:12 AM: >>>> >>>>> From: Jan Kiszka <jan.kiszka@xxxxxx> >>>>> To: Abel Gordon/Haifa/IBM@IBMIL, >>>>> Cc: gleb@xxxxxxxxxx, kvm@xxxxxxxxxxxxxxx, kvm-owner@xxxxxxxxxxxxxxx, >>>>> pbonzini@xxxxxxxxxx, "李春奇 <Arthur Chunqi Li>" <yzt356@xxxxxxxxx> >>>>> Date: 25/08/2013 10:43 AM >>>>> Subject: Re: [PATCH] KVM: nVMX: Fully support of nested VMX preemption >>>> timer >>>>> Sent by: kvm-owner@xxxxxxxxxxxxxxx >>>>> >>>>> On 2013-08-25 09:37, Abel Gordon wrote: >>>>>> >>>>>> >>>>>>> From: Jan Kiszka <jan.kiszka@xxxxxx> >>>>>>> To: "李春奇 <Arthur Chunqi Li>" <yzt356@xxxxxxxxx>, >>>>>>> Cc: kvm@xxxxxxxxxxxxxxx, gleb@xxxxxxxxxx, pbonzini@xxxxxxxxxx >>>>>>> Date: 25/08/2013 09:44 AM >>>>>>> Subject: Re: [PATCH] KVM: nVMX: Fully support of nested VMX >> preemption >>>>>> timer >>>>>>> Sent by: kvm-owner@xxxxxxxxxxxxxxx >>>>>>> >>>>>>> On 2013-08-24 20:44, root wrote: >>>>>>>> This patch contains the following two changes: >>>>>>>> 1. Fix the bug in nested preemption timer support. If vmexit L2->L0 >>>>>>>> with some reasons not emulated by L1, preemption timer value should >>>>>>>> be save in such exits. >>>>>>>> 2. Add support of "Save VMX-preemption timer value" VM-Exit >> controls >>>>>>>> to nVMX. >>>>>>>> >>>>>>>> With this patch, nested VMX preemption timer features are fully >>>>>>>> supported. >>>>>>>> >>>>>>>> Signed-off-by: Arthur Chunqi Li <yzt356@xxxxxxxxx> >>>>>>>> --- >>>>>> >>>>>>>> >>>>>>>> @@ -7578,9 +7579,14 @@ static void prepare_vmcs02(struct kvm_vcpu >>>>>>> *vcpu, struct vmcs12 *vmcs12) >>>>>>>> (vmcs_config.pin_based_exec_ctrl | >>>>>>>> vmcs12->pin_based_vm_exec_control)); >>>>>>>> >>>>>>>> - if (vmcs12->pin_based_vm_exec_control & >>>>>> PIN_BASED_VMX_PREEMPTION_TIMER) >>>>>>>> - vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, >>>>>>>> - vmcs12->vmx_preemption_timer_value); >>>>>>>> + if (vmcs12->pin_based_vm_exec_control & >>>>>>> PIN_BASED_VMX_PREEMPTION_TIMER) { >>>>>>>> + if (vmcs12->vm_exit_controls & >>>>>> VM_EXIT_SAVE_VMX_PREEMPTION_TIMER) >>>>>>>> + vmcs12->vmx_preemption_timer_value = >>>>>>>> + vmcs_read32(VMX_PREEMPTION_TIMER_VALUE); >>>>>>>> + else >>>>>>>> + vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, >>>>>>>> + vmcs12->vmx_preemption_timer_value); >>>>>>>> + } >>>>>>> >>>>>>> This is not correct. We still need to set the vmcs to >>>>>>> vmx_preemption_timer_value. The difference is that, on exit from L2, >>>>>>> vmx_preemption_timer_value has to be updated according to the saved >>>>>>> hardware state. The corresponding code is missing in your patch so >>>> far. >>>>>> >>>>>> I think something else maybe be missing here: assuming L0 handles >> exits >>>>>> for L2 without involving L1 (e.g. external interrupts or ept >>>> violations), >>>>>> then, we may spend some cycles in L0 handling these exits. Note L1 is >>>> not >>>>>> aware of these exits and from L1 perspective L2 was running on the >> CPU. >>>>>> That means that we may need to reduce these cycles spent at >>>>>> L0 from the preemtion timer or emulate a preemption timer exit to >>>>>> force a transition to L1 instead of resuming L2. >>>>> >>>>> That's precisely what the logic I described should achieve: reload the >>>>> value we saved on L2 exit on reentry. >>>> >>>> But don't you think we should also reduce the cycles spent at L0 from >> the >>>> preemption timer ? I mean, if we spent X cycles at L0 handling a L2 >> exit >>>> which was not forwarded to L1, then, before we resume L2, >>>> the preemption timer should be: (previous_value_on_exit - X). >>>> If (previous_value_on_exit - X) < 0, then we should force ("emulate") a >>>> preemption timer exit between L2 and L1. >>> >>> We ask the hardware to save the value of the preemption on L2 exit. This >>> value will be exposed to L1 (if it asked for saving as well) and/or be >>> written back to the hardware on L2 reenty (unless L1 had a chance to run >>> and modified it). So the time spent in L0 is implicitly subtracted. >> >> I think you are suggesting the following, please correct me if I am wrong. >> 1) L1 resumes L2 with preemption timer enabled >> 2) L0 emulates the resume/launch >> 3) L2 runs for Y cycles until an external interrupt occurs (Y < preemption >> timer specified by L1) >> 4) L0 saved the preemption timer (original value - Y) >> 5) L0 spends X cycles handling the external interrupt >> 6) L0 resumes L2 with preemption timer = original value - Y >> >> Note that in this case "X is ignored". >> >> I was suggesting to do the following: >> 6) If original value - Y - X > 0 then >> L0 resumes L2 with preemption timer = original value - Y - X >> else >> L0 emulates a L2->L1 preemption timer exit (resumes L1) > Yes, your description is right. But I'm also thinking about my > previous consideration, why should we consider such X cycles as what > L2 spent. For nested VMX. external interrupt is not provided by L1, it > is triggered from L0 and want to cause periodically exit to L1, L2 is > "accidentally injure" actually. Since these interrupts are not > generated from L1 and not attend to affect L2, these cycles should not > be treated as what L2 spent. Though these cycles are "spent" in view > of L1, but they should not be taken into consideration in nested VMX. > > For another example, if vcpu scheduled out when L0 handing such > interrupts and CPU does some other things then schedule this vcpu > again, these cycles of executing other processes should not be treated > as what L2 spent definitely. Think of your preemption timer test case: There you are indirectly comparing the timer value against the TSC by checking the a preemption timer exit happened after no more than n TSC cycles. But as the TSC L1 and L2 sees continued to tick while in L0, this test could now fail when we leave out the L0 cycles. An alternative would be to hide all L0 TSC cycles from the guest. But that's not the way KVM works, independent of the preemption timer case. BTW, you should use guest_read_tsc() on exit/entry of L2 in order to calculate the time spent in L0. This will ensure that potential tweaks of TSC_OFFSET that L0 might have applied in the meantime will be taken into account. Jan
Attachment:
signature.asc
Description: OpenPGP digital signature