On 25/05/2016 00:27, Yunhong Jiang wrote: > From: Yunhong Jiang <yunhong.jiang@xxxxxxxxx> > > Utilizing the VMX preemption timer for tsc deadline timer > virtualization. The VMX preemption timer is armed when the vCPU is > running, and a VMExit will happen if the virtual TSC deadline timer > expires. > > When the vCPU thread is scheduled out, the tsc deadline timer > virtualization will be switched to use the current solution, i.e. use > the timer for it. It's switched back to VMX preemption timer when the > vCPU thread is scheduled int. > > This solution avoids the complex OS's hrtimer system, and also the host > timer interrupt handling cost, with a preemption_timer VMexit. It fits > well for some NFV usage scenario, when the vCPU is bound to a pCPU and > the pCPU is isolated, or some similar scenario. > > However, it possibly has impact if the vCPU thread is scheduled in/out > very frequently, because it switches from/to the hrtimer emulation a lot. > > Signed-off-by: Yunhong Jiang <yunhong.jiang@xxxxxxxxx> > --- > arch/x86/kvm/lapic.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++-- > arch/x86/kvm/lapic.h | 11 ++++++ > arch/x86/kvm/trace.h | 22 ++++++++++++ > arch/x86/kvm/vmx.c | 8 +++++ > arch/x86/kvm/x86.c | 4 +++ > 5 files changed, 139 insertions(+), 3 deletions(-) Thanks, this looks very nice. As I mentioned in my reply to Marcelo, I have some more changes in mind to reduce the overhead. This way we can enable it unconditionally (on processors that do not have the erratum). In particular: 1) While a virtual machine is scheduled out due to contention, you don't care if the timer fires. You can fire the timer immediately after the next vmentry. You only need to set a timer during HLT. So, rather than sched_out/sched_in, you can start and stop the hrtimer in vmx_pre_block and vmx_post_block. You probably should rename what is now vmx_pre_block and vmx_post_block to something like pi_pre_block and pi_post_block, so that vmx_pre_block and vmx_post_block are like if (pi_pre_block(vcpu)) return 1; if (have preemption timer) { kvm_lapic_switch_to_sw_timer(vcpu); and if (have preemption timer) kvm_lapic_switch_to_hv_timer(vcpu); pi_post_block(vcpu); respectively. You'll also need to check whether the deadlock in switch_to_sw_lapic_timer is still possible. Hopefully not---if so, simpler code! :) 2) if possible, I would like to remove kvm_lapic_arm_hv_timer. Instead, make switch_to_hv_lapic_timer and start_apic_timer can call kvm_x86_ops->set_hv_timer, passing the desired *guest* TSC. set_hv_timer can convert the guest TSC to a host TSC and save the desired host TSC in struct vmx_vcpu. vcpu_run checks if there is a TSC deadline and, if so, writes the delta into VMX_PREEMPTION_TIMER_VALUE. set_hv_timer/cancel_hv_timer only write to PIN_BASED_VM_EXEC_CONTROL. Besides making vcpu_run faster, I think that with this change the distinction between HV_TIMER_NEEDS_ARMING and HV_TIMER_ARMED disappears, and struct kvm_lapic can simply have a bool hv_timer_in_use. Again, it might actually end up making the code simpler, especially the interface between lapic.c and vmx.c. The hardest part is converting guest_tsc to host_tsc. From guest_tsc = (host_tsc * vcpu->arch.tsc_scaling_ratio >> kvm_tsc_scaling_ratio_frac_bits) + tsc_offset you have host_tsc = ((unsigned __int128)(guest_tsc - tsc_offset) << kvm_tsc_scaling_ratio_frac_bits) / vcpu->arch.tsc_scaling_ratio; ... except that you need a division with a 128-bit dividend and a 64-bit divisor here, and there is no such thing in Linux. It's okay if you restrict this to 64-bit hosts and use the divq assembly instruction directly. (You also need to trap the divide overflow exception; if there is an overflow just disable the preemption timer). On 32-bit systems, it's okay to force-disable set_hv_timer/cancel_hv_timer, i.e. set them to NULL. And if you do this, pre_block and post_block become: if (pi_pre_block(vcpu)) return 1; if (preemption timer bit set in VMCS) kvm_lapic_start_sw_timer(vcpu); and if (preemption timer bit set in VMCS) kvm_lapic_cancel_sw_timer(vcpu); pi_post_block(vcpu); There is no need to redo the preemption timer setup in post_block. 3) regarding the erratum, you can use the code from https://www.virtualbox.org/svn/vbox/trunk/src/VBox/VMM/VMMR0/HMR0.cpp (function hmR0InitIntelIsSubjectToVmxPreemptionTimerErratum... please rename it to something else! :)). Thanks, Paolo > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c > index f1cf8a5ede11..93679db7ce0f 100644 > --- a/arch/x86/kvm/lapic.c > +++ b/arch/x86/kvm/lapic.c > @@ -1313,7 +1313,7 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu) > __delay(tsc_deadline - guest_tsc); > } > > -static void start_sw_tscdeadline(struct kvm_lapic *apic) > +static void start_sw_tscdeadline(struct kvm_lapic *apic, int no_expire) > { > u64 guest_tsc, tscdeadline = apic->lapic_timer.tscdeadline; > u64 ns = 0; > @@ -1330,7 +1330,7 @@ static void start_sw_tscdeadline(struct kvm_lapic *apic) > > now = apic->lapic_timer.timer.base->get_time(); > guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); > - if (likely(tscdeadline > guest_tsc)) { > + if (no_expire || likely(tscdeadline > guest_tsc)) { > ns = (tscdeadline - guest_tsc) * 1000000ULL; > do_div(ns, this_tsc_khz); > expire = ktime_add_ns(now, ns); > @@ -1343,6 +1343,85 @@ static void start_sw_tscdeadline(struct kvm_lapic *apic) > local_irq_restore(flags); > } > > +void kvm_lapic_arm_hv_timer(struct kvm_vcpu *vcpu) > +{ > + struct kvm_lapic *apic = vcpu->arch.apic; > + u64 tscdeadline, guest_tsc; > + > + if (apic->lapic_timer.hv_timer_state == HV_TIMER_NOT_USED) > + return; > + > + tscdeadline = apic->lapic_timer.tscdeadline; > + guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); > + > + if (tscdeadline >= guest_tsc) > + kvm_x86_ops->set_hv_timer(vcpu, tscdeadline - guest_tsc); > + else > + kvm_x86_ops->set_hv_timer(vcpu, 0); > + > + apic->lapic_timer.hv_timer_state = HV_TIMER_ARMED; > + trace_kvm_hv_timer_state(vcpu->vcpu_id, > + apic->lapic_timer.hv_timer_state); > +} > +EXPORT_SYMBOL_GPL(kvm_lapic_arm_hv_timer); > + > +void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu) > +{ > + struct kvm_lapic *apic = vcpu->arch.apic; > + > + WARN_ON(apic->lapic_timer.hv_timer_state != HV_TIMER_ARMED); > + WARN_ON(swait_active(&vcpu->wq)); > + kvm_x86_ops->cancel_hv_timer(vcpu); > + apic->lapic_timer.hv_timer_state = HV_TIMER_NOT_USED; > + apic_timer_expired(apic); > +} > +EXPORT_SYMBOL_GPL(kvm_lapic_expired_hv_timer); > + > +void switch_to_hv_lapic_timer(struct kvm_vcpu *vcpu) > +{ > + struct kvm_lapic *apic = vcpu->arch.apic; > + > + WARN_ON(apic->lapic_timer.hv_timer_state != HV_TIMER_NOT_USED); > + > + if (apic_lvtt_tscdeadline(apic) && > + !atomic_read(&apic->lapic_timer.pending)) { > + hrtimer_cancel(&apic->lapic_timer.timer); > + /* In case the timer triggered in above small window */ > + if (!atomic_read(&apic->lapic_timer.pending)) { > + apic->lapic_timer.hv_timer_state = > + HV_TIMER_NEEDS_ARMING; > + trace_kvm_hv_timer_state(vcpu->vcpu_id, > + apic->lapic_timer.hv_timer_state); > + } > + } > +} > +EXPORT_SYMBOL_GPL(switch_to_hv_lapic_timer); > + > +void switch_to_sw_lapic_timer(struct kvm_vcpu *vcpu) > +{ > + struct kvm_lapic *apic = vcpu->arch.apic; > + > + /* Possibly the TSC deadline timer is not enabled yet */ > + if (apic->lapic_timer.hv_timer_state == HV_TIMER_NOT_USED) > + return; > + > + if (apic->lapic_timer.hv_timer_state == HV_TIMER_ARMED) > + kvm_x86_ops->cancel_hv_timer(vcpu); > + apic->lapic_timer.hv_timer_state = HV_TIMER_NOT_USED; > + > + if (atomic_read(&apic->lapic_timer.pending)) > + return; > + > + /* > + * Don't trigger the apic_timer_expired() for deadlock, > + * because the swake_up() from apic_timer_expired() will > + * try to get the run queue lock, which has been held here > + * since we are in context switch procedure already. > + */ > + start_sw_tscdeadline(apic, 1); > +} > +EXPORT_SYMBOL_GPL(switch_to_sw_lapic_timer); > + > static void start_apic_timer(struct kvm_lapic *apic) > { > ktime_t now; > @@ -1389,7 +1468,19 @@ static void start_apic_timer(struct kvm_lapic *apic) > ktime_to_ns(ktime_add_ns(now, > apic->lapic_timer.period))); > } else if (apic_lvtt_tscdeadline(apic)) { > - start_sw_tscdeadline(apic); > + /* lapic timer in tsc deadline mode */ > + if (kvm_x86_ops->set_hv_timer) { > + if (unlikely(!apic->lapic_timer.tscdeadline || > + !apic->vcpu->arch.virtual_tsc_khz)) > + return; > + > + /* Expired timer will be checked on vcpu_run() */ > + apic->lapic_timer.hv_timer_state = > + HV_TIMER_NEEDS_ARMING; > + trace_kvm_hv_timer_state(apic->vcpu->vcpu_id, > + apic->lapic_timer.hv_timer_state); > + } else > + start_sw_tscdeadline(apic, 0); > } > } > > diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h > index 891c6da7d4aa..dc4fd8eea04d 100644 > --- a/arch/x86/kvm/lapic.h > +++ b/arch/x86/kvm/lapic.h > @@ -12,6 +12,12 @@ > #define KVM_APIC_SHORT_MASK 0xc0000 > #define KVM_APIC_DEST_MASK 0x800 > > +enum { > + HV_TIMER_NOT_USED, > + HV_TIMER_NEEDS_ARMING, > + HV_TIMER_ARMED, > +}; > + > struct kvm_timer { > struct hrtimer timer; > s64 period; /* unit: ns */ > @@ -20,6 +26,7 @@ struct kvm_timer { > u64 tscdeadline; > u64 expired_tscdeadline; > atomic_t pending; /* accumulated triggered timers */ > + int hv_timer_state; > }; > > struct kvm_lapic { > @@ -212,4 +219,8 @@ bool kvm_intr_is_single_vcpu_fast(struct kvm *kvm, struct kvm_lapic_irq *irq, > struct kvm_vcpu **dest_vcpu); > int kvm_vector_to_index(u32 vector, u32 dest_vcpus, > const unsigned long *bitmap, u32 bitmap_size); > +void switch_to_sw_lapic_timer(struct kvm_vcpu *vcpu); > +void switch_to_hv_lapic_timer(struct kvm_vcpu *vcpu); > +void kvm_lapic_arm_hv_timer(struct kvm_vcpu *vcpu); > +void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu); > #endif > diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h > index 8de925031b5c..8e1b5e9c2e78 100644 > --- a/arch/x86/kvm/trace.h > +++ b/arch/x86/kvm/trace.h > @@ -6,6 +6,7 @@ > #include <asm/svm.h> > #include <asm/clocksource.h> > #include <asm/pvclock-abi.h> > +#include <lapic.h> > > #undef TRACE_SYSTEM > #define TRACE_SYSTEM kvm > @@ -1348,6 +1349,27 @@ TRACE_EVENT(kvm_avic_unaccelerated_access, > __entry->vec) > ); > > +#define kvm_trace_symbol_hv_timer \ > + {HV_TIMER_NOT_USED, "no"}, \ > + {HV_TIMER_NEEDS_ARMING, "need_arming"}, \ > + {HV_TIMER_ARMED, "armed"} > + > +TRACE_EVENT(kvm_hv_timer_state, > + TP_PROTO(unsigned int vcpu_id, unsigned int hv_timer_state), > + TP_ARGS(vcpu_id, hv_timer_state), > + TP_STRUCT__entry( > + __field(unsigned int, vcpu_id) > + __field(unsigned int, hv_timer_state) > + ), > + TP_fast_assign( > + __entry->vcpu_id = vcpu_id; > + __entry->hv_timer_state = hv_timer_state; > + ), > + TP_printk("vcpu_id %x hv_timer %s\n", > + __entry->vcpu_id, > + __print_symbolic(__entry->hv_timer_state, > + kvm_trace_symbol_hv_timer)) > + ); > #endif /* _TRACE_KVM_H */ > > #undef TRACE_INCLUDE_PATH > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index 2b29afa61715..dc0b8cae02b8 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -7576,6 +7576,11 @@ static int handle_pcommit(struct kvm_vcpu *vcpu) > return 1; > } > > +static int handle_preemption_timer(struct kvm_vcpu *vcpu) > +{ > + kvm_lapic_expired_hv_timer(vcpu); > + return 1; > +} > /* > * The exit handlers return 1 if the exit was handled fully and guest execution > * may resume. Otherwise they set the kvm_run parameter to indicate what needs > @@ -7627,6 +7632,7 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = { > [EXIT_REASON_XRSTORS] = handle_xrstors, > [EXIT_REASON_PML_FULL] = handle_pml_full, > [EXIT_REASON_PCOMMIT] = handle_pcommit, > + [EXIT_REASON_PREEMPTION_TIMER] = handle_preemption_timer, > }; > > static const int kvm_vmx_max_exit_handlers = > @@ -8678,6 +8684,8 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu) > if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) > vmx_set_interrupt_shadow(vcpu, 0); > > + kvm_lapic_arm_hv_timer(vcpu); > + > if (vmx->guest_pkru_valid) > __write_pkru(vmx->guest_pkru); > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 269d576520ba..f4b50608568a 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -7724,11 +7724,15 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) > > void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) > { > + if (kvm_x86_ops->set_hv_timer) > + switch_to_hv_lapic_timer(vcpu); > kvm_x86_ops->sched_in(vcpu, cpu); > } > > void kvm_arch_sched_out(struct kvm_vcpu *vcpu) > { > + if (kvm_x86_ops->set_hv_timer) > + switch_to_sw_lapic_timer(vcpu); > } > > int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html