Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer

yunhong jiang <yunhong.jiang@xxxxxxxxxxxxxxx> · Wed, 25 May 2016 15:44:20 -0700

On Wed, 25 May 2016 13:52:56 +0200
Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote:

> 
> 
> On 25/05/2016 00:27, Yunhong Jiang wrote:
> > From: Yunhong Jiang <yunhong.jiang@xxxxxxxxx>
> > 
> > Utilizing the VMX preemption timer for tsc deadline timer
> > virtualization. The VMX preemption timer is armed when the vCPU is
> > running, and a VMExit will happen if the virtual TSC deadline timer
> > expires.
> > 
> > When the vCPU thread is scheduled out, the tsc deadline timer
> > virtualization will be switched to use the current solution, i.e.
> > use the timer for it. It's switched back to VMX preemption timer
> > when the vCPU thread is scheduled int.
> > 
> > This solution avoids the complex OS's hrtimer system, and also the
> > host timer interrupt handling cost, with a preemption_timer VMexit.
> > It fits well for some NFV usage scenario, when the vCPU is bound to
> > a pCPU and the pCPU is isolated, or some similar scenario.
> > 
> > However, it possibly has impact if the vCPU thread is scheduled
> > in/out very frequently, because it switches from/to the hrtimer
> > emulation a lot.
> > 
> > Signed-off-by: Yunhong Jiang <yunhong.jiang@xxxxxxxxx>
> > ---
> >  arch/x86/kvm/lapic.c | 97
> > ++++++++++++++++++++++++++++++++++++++++++++++++++--
> > arch/x86/kvm/lapic.h | 11 ++++++ arch/x86/kvm/trace.h | 22
> > ++++++++++++ arch/x86/kvm/vmx.c   |  8 +++++
> >  arch/x86/kvm/x86.c   |  4 +++
> >  5 files changed, 139 insertions(+), 3 deletions(-)
> 
> Thanks, this looks very nice.
> 
> As I mentioned in my reply to Marcelo, I have some more changes in
> mind to reduce the overhead.  This way we can enable it
> unconditionally (on processors that do not have the erratum).   In
> particular:
> 
> 1) While a virtual machine is scheduled out due to contention, you
> don't care if the timer fires.  You can fire the timer immediately

I'm not very sure if this statement applies to all situation, but I can't think
out any exception. So sure, I will work this way.

> after the next vmentry.  You only need to set a timer during HLT.
> So, rather than sched_out/sched_in, you can start and stop the hrtimer
> in vmx_pre_block and vmx_post_block.  You probably should rename what
> is now vmx_pre_block and vmx_post_block to something like
> pi_pre_block and pi_post_block, so that vmx_pre_block and
> vmx_post_block are like
> 
> 	if (pi_pre_block(vcpu))
> 		return 1;
> 	if (have preemption timer) {
> 		kvm_lapic_switch_to_sw_timer(vcpu);
> 
> and
> 
> 	if (have preemption timer)
> 		kvm_lapic_switch_to_hv_timer(vcpu);
> 	pi_post_block(vcpu);
> 
> respectively.
> 
> You'll also need to check whether the deadlock in
> switch_to_sw_lapic_timer is still possible.  Hopefully not---if so,
> simpler code! :)

Hmm, the deadlock should not happen here, but I will try and see.

> 
> 2) if possible, I would like to remove kvm_lapic_arm_hv_timer.
> Instead, make switch_to_hv_lapic_timer and start_apic_timer can call
> kvm_x86_ops->set_hv_timer, passing the desired *guest* TSC.
> set_hv_timer can convert the guest TSC to a host TSC and save the
> desired host TSC in struct vmx_vcpu.  vcpu_run checks if there is a
> TSC deadline and, if so, writes the delta into
> VMX_PREEMPTION_TIMER_VALUE. set_hv_timer/cancel_hv_timer only write
> to PIN_BASED_VM_EXEC_CONTROL.
> 
> Besides making vcpu_run faster, I think that with this change the
> distinction between HV_TIMER_NEEDS_ARMING and HV_TIMER_ARMED
> disappears, and struct kvm_lapic can simply have a bool
> hv_timer_in_use.  Again, it might actually end up making the code
> simpler, especially the interface between lapic.c and vmx.c.

Yes, this should save two VMCS access on the vcpu_run(). Will do this way.

> 
> The hardest part is converting guest_tsc to host_tsc.  From
> 
> 	guest_tsc = (host_tsc * vcpu->arch.tsc_scaling_ratio >>
> 			kvm_tsc_scaling_ratio_frac_bits) + tsc_offset
> 
> you have
> 
> 	host_tsc = ((unsigned __int128)(guest_tsc - tsc_offset)
> 			<< kvm_tsc_scaling_ratio_frac_bits)
> 			/ vcpu->arch.tsc_scaling_ratio;
> 
> ... except that you need a division with a 128-bit dividend and a
> 64-bit divisor here, and there is no such thing in Linux.  It's okay
> if you restrict this to 64-bit hosts and use the divq assembly
> instruction directly.  (You also need to trap the divide overflow
> exception; if there is an overflow just disable the preemption
> timer).  On 32-bit systems, it's okay to force-disable
> set_hv_timer/cancel_hv_timer, i.e. set them to NULL.

I'm scared by this conversion :) Sure will hav a try. Need firstly check
how the scale_tsc works.

> 
> And if you do this, pre_block and post_block become:
> 
> 	if (pi_pre_block(vcpu))
> 		return 1;
> 	if (preemption timer bit set in VMCS)
> 		kvm_lapic_start_sw_timer(vcpu);
> 
> and
> 
> 	if (preemption timer bit set in VMCS)
> 		kvm_lapic_cancel_sw_timer(vcpu);
> 	pi_post_block(vcpu);
> 
> There is no need to redo the preemption timer setup in post_block.
> 
> 3) regarding the erratum, you can use the code from
> https://www.virtualbox.org/svn/vbox/trunk/src/VBox/VMM/VMMR0/HMR0.cpp
> (function hmR0InitIntelIsSubjectToVmxPreemptionTimerErratum... please
> rename it to something else! :)).

Thanks for the information.

--jyh

> 
> Thanks,
> 
> Paolo
> 
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index f1cf8a5ede11..93679db7ce0f 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -1313,7 +1313,7 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
> >  		__delay(tsc_deadline - guest_tsc);
> >  }
> >  
> > -static void start_sw_tscdeadline(struct kvm_lapic *apic)
> > +static void start_sw_tscdeadline(struct kvm_lapic *apic, int
> > no_expire) {
> >  	u64 guest_tsc, tscdeadline = apic->lapic_timer.tscdeadline;
> >  	u64 ns = 0;
> > @@ -1330,7 +1330,7 @@ static void start_sw_tscdeadline(struct
> > kvm_lapic *apic) 
> >  	now = apic->lapic_timer.timer.base->get_time();
> >  	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> > -	if (likely(tscdeadline > guest_tsc)) {
> > +	if (no_expire || likely(tscdeadline > guest_tsc)) {
> >  		ns = (tscdeadline - guest_tsc) * 1000000ULL;
> >  		do_div(ns, this_tsc_khz);
> >  		expire = ktime_add_ns(now, ns);
> > @@ -1343,6 +1343,85 @@ static void start_sw_tscdeadline(struct
> > kvm_lapic *apic) local_irq_restore(flags);
> >  }
> >  
> > +void kvm_lapic_arm_hv_timer(struct kvm_vcpu *vcpu)
> > +{
> > +	struct kvm_lapic *apic = vcpu->arch.apic;
> > +	u64 tscdeadline, guest_tsc;
> > +
> > +	if (apic->lapic_timer.hv_timer_state == HV_TIMER_NOT_USED)
> > +		return;
> > +
> > +	tscdeadline = apic->lapic_timer.tscdeadline;
> > +	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> > +
> > +	if (tscdeadline >= guest_tsc)
> > +		kvm_x86_ops->set_hv_timer(vcpu, tscdeadline -
> > guest_tsc);
> > +	else
> > +		kvm_x86_ops->set_hv_timer(vcpu, 0);
> > +
> > +	apic->lapic_timer.hv_timer_state = HV_TIMER_ARMED;
> > +	trace_kvm_hv_timer_state(vcpu->vcpu_id,
> > +			apic->lapic_timer.hv_timer_state);
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_lapic_arm_hv_timer);
> > +
> > +void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu)
> > +{
> > +	struct kvm_lapic *apic = vcpu->arch.apic;
> > +
> > +	WARN_ON(apic->lapic_timer.hv_timer_state !=
> > HV_TIMER_ARMED);
> > +	WARN_ON(swait_active(&vcpu->wq));
> > +	kvm_x86_ops->cancel_hv_timer(vcpu);
> > +	apic->lapic_timer.hv_timer_state = HV_TIMER_NOT_USED;
> > +	apic_timer_expired(apic);
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_lapic_expired_hv_timer);
> > +
> > +void switch_to_hv_lapic_timer(struct kvm_vcpu *vcpu)
> > +{
> > +	struct kvm_lapic *apic = vcpu->arch.apic;
> > +
> > +	WARN_ON(apic->lapic_timer.hv_timer_state !=
> > HV_TIMER_NOT_USED); +
> > +	if (apic_lvtt_tscdeadline(apic) &&
> > +	    !atomic_read(&apic->lapic_timer.pending)) {
> > +		hrtimer_cancel(&apic->lapic_timer.timer);
> > +		/* In case the timer triggered in above small
> > window */
> > +		if (!atomic_read(&apic->lapic_timer.pending)) {
> > +			apic->lapic_timer.hv_timer_state =
> > +				HV_TIMER_NEEDS_ARMING;
> > +			trace_kvm_hv_timer_state(vcpu->vcpu_id,
> > +
> > apic->lapic_timer.hv_timer_state);
> > +		}
> > +	}
> > +}
> > +EXPORT_SYMBOL_GPL(switch_to_hv_lapic_timer);
> > +
> > +void switch_to_sw_lapic_timer(struct kvm_vcpu *vcpu)
> > +{
> > +	struct kvm_lapic *apic = vcpu->arch.apic;
> > +
> > +	/* Possibly the TSC deadline timer is not enabled yet */
> > +	if (apic->lapic_timer.hv_timer_state == HV_TIMER_NOT_USED)
> > +		return;
> > +
> > +	if (apic->lapic_timer.hv_timer_state == HV_TIMER_ARMED)
> > +		kvm_x86_ops->cancel_hv_timer(vcpu);
> > +	apic->lapic_timer.hv_timer_state = HV_TIMER_NOT_USED;
> > +
> > +	if (atomic_read(&apic->lapic_timer.pending))
> > +		return;
> > +
> > +	/*
> > +	 * Don't trigger the apic_timer_expired() for deadlock,
> > +	 * because the swake_up() from apic_timer_expired() will
> > +	 * try to get the run queue lock, which has been held here
> > +	 * since we are in context switch procedure already.
> > +	 */
> > +	start_sw_tscdeadline(apic, 1);
> > +}
> > +EXPORT_SYMBOL_GPL(switch_to_sw_lapic_timer);
> > +
> >  static void start_apic_timer(struct kvm_lapic *apic)
> >  {
> >  	ktime_t now;
> > @@ -1389,7 +1468,19 @@ static void start_apic_timer(struct
> > kvm_lapic *apic) ktime_to_ns(ktime_add_ns(now,
> >  					apic->lapic_timer.period)));
> >  	} else if (apic_lvtt_tscdeadline(apic)) {
> > -		start_sw_tscdeadline(apic);
> > +		/* lapic timer in tsc deadline mode */
> > +		if (kvm_x86_ops->set_hv_timer) {
> > +			if
> > (unlikely(!apic->lapic_timer.tscdeadline ||
> > +					!apic->vcpu->arch.virtual_tsc_khz))
> > +				return;
> > +
> > +			/* Expired timer will be checked on
> > vcpu_run() */
> > +			apic->lapic_timer.hv_timer_state =
> > +				HV_TIMER_NEEDS_ARMING;
> > +
> > trace_kvm_hv_timer_state(apic->vcpu->vcpu_id,
> > +
> > apic->lapic_timer.hv_timer_state);
> > +		} else
> > +			start_sw_tscdeadline(apic, 0);
> >  	}
> >  }
> >  
> > diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> > index 891c6da7d4aa..dc4fd8eea04d 100644
> > --- a/arch/x86/kvm/lapic.h
> > +++ b/arch/x86/kvm/lapic.h
> > @@ -12,6 +12,12 @@
> >  #define KVM_APIC_SHORT_MASK	0xc0000
> >  #define KVM_APIC_DEST_MASK	0x800
> >  
> > +enum {
> > +	HV_TIMER_NOT_USED,
> > +	HV_TIMER_NEEDS_ARMING,
> > +	HV_TIMER_ARMED,
> > +};
> > +
> >  struct kvm_timer {
> >  	struct hrtimer timer;
> >  	s64 period; 				/* unit: ns */
> > @@ -20,6 +26,7 @@ struct kvm_timer {
> >  	u64 tscdeadline;
> >  	u64 expired_tscdeadline;
> >  	atomic_t pending;			/* accumulated
> > triggered timers */
> > +	int hv_timer_state;
> >  };
> >  
> >  struct kvm_lapic {
> > @@ -212,4 +219,8 @@ bool kvm_intr_is_single_vcpu_fast(struct kvm
> > *kvm, struct kvm_lapic_irq *irq, struct kvm_vcpu **dest_vcpu);
> >  int kvm_vector_to_index(u32 vector, u32 dest_vcpus,
> >  			const unsigned long *bitmap, u32
> > bitmap_size); +void switch_to_sw_lapic_timer(struct kvm_vcpu *vcpu);
> > +void switch_to_hv_lapic_timer(struct kvm_vcpu *vcpu);
> > +void kvm_lapic_arm_hv_timer(struct kvm_vcpu *vcpu);
> > +void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu);
> >  #endif
> > diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> > index 8de925031b5c..8e1b5e9c2e78 100644
> > --- a/arch/x86/kvm/trace.h
> > +++ b/arch/x86/kvm/trace.h
> > @@ -6,6 +6,7 @@
> >  #include <asm/svm.h>
> >  #include <asm/clocksource.h>
> >  #include <asm/pvclock-abi.h>
> > +#include <lapic.h>
> >  
> >  #undef TRACE_SYSTEM
> >  #define TRACE_SYSTEM kvm
> > @@ -1348,6 +1349,27 @@ TRACE_EVENT(kvm_avic_unaccelerated_access,
> >  		  __entry->vec)
> >  );
> >  
> > +#define kvm_trace_symbol_hv_timer		\
> > +	{HV_TIMER_NOT_USED, "no"},		\
> > +	{HV_TIMER_NEEDS_ARMING, "need_arming"},	\
> > +	{HV_TIMER_ARMED, "armed"}
> > +
> > +TRACE_EVENT(kvm_hv_timer_state,
> > +		TP_PROTO(unsigned int vcpu_id, unsigned int
> > hv_timer_state),
> > +		TP_ARGS(vcpu_id, hv_timer_state),
> > +		TP_STRUCT__entry(
> > +			__field(unsigned int, vcpu_id)
> > +			__field(unsigned int, hv_timer_state)
> > +			),
> > +		TP_fast_assign(
> > +			__entry->vcpu_id = vcpu_id;
> > +			__entry->hv_timer_state = hv_timer_state;
> > +			),
> > +		TP_printk("vcpu_id %x hv_timer %s\n",
> > +			__entry->vcpu_id,
> > +			__print_symbolic(__entry->hv_timer_state,
> > +				kvm_trace_symbol_hv_timer))
> > +	   );
> >  #endif /* _TRACE_KVM_H */
> >  
> >  #undef TRACE_INCLUDE_PATH
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 2b29afa61715..dc0b8cae02b8 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -7576,6 +7576,11 @@ static int handle_pcommit(struct kvm_vcpu
> > *vcpu) return 1;
> >  }
> >  
> > +static int handle_preemption_timer(struct kvm_vcpu *vcpu)
> > +{
> > +	kvm_lapic_expired_hv_timer(vcpu);
> > +	return 1;
> > +}
> >  /*
> >   * The exit handlers return 1 if the exit was handled fully and
> > guest execution
> >   * may resume.  Otherwise they set the kvm_run parameter to
> > indicate what needs @@ -7627,6 +7632,7 @@ static int (*const
> > kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) =
> > { [EXIT_REASON_XRSTORS]                 = handle_xrstors,
> > [EXIT_REASON_PML_FULL]		      = handle_pml_full,
> > [EXIT_REASON_PCOMMIT]                 = handle_pcommit,
> > +	[EXIT_REASON_PREEMPTION_TIMER]	      =
> > handle_preemption_timer, };
> >  
> >  static const int kvm_vmx_max_exit_handlers =
> > @@ -8678,6 +8684,8 @@ static void __noclone vmx_vcpu_run(struct
> > kvm_vcpu *vcpu) if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
> >  		vmx_set_interrupt_shadow(vcpu, 0);
> >  
> > +	kvm_lapic_arm_hv_timer(vcpu);
> > +
> >  	if (vmx->guest_pkru_valid)
> >  		__write_pkru(vmx->guest_pkru);
> >  
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 269d576520ba..f4b50608568a 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -7724,11 +7724,15 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu
> > *vcpu) 
> >  void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
> >  {
> > +	if (kvm_x86_ops->set_hv_timer)
> > +		switch_to_hv_lapic_timer(vcpu);
> >  	kvm_x86_ops->sched_in(vcpu, cpu);
> >  }
> >  
> >  void kvm_arch_sched_out(struct kvm_vcpu *vcpu)
> >  {
> > +	if (kvm_x86_ops->set_hv_timer)
> > +		switch_to_sw_lapic_timer(vcpu);
> >  }
> >  
> >  int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
> > 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html