Re: [RFC PATCH V2 4/4] Utilize the vmx preemption timer for tsc deadline timer

yunhong jiang <yunhong.jiang@xxxxxxxxxxxxxxx> · Fri, 3 Jun 2016 17:24:50 -0700

On Wed, 25 May 2016 13:52:56 +0200
Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote:

> 
> 
> On 25/05/2016 00:27, Yunhong Jiang wrote:
> > From: Yunhong Jiang <yunhong.jiang@xxxxxxxxx>
> > 
> > Utilizing the VMX preemption timer for tsc deadline timer
> > virtualization. The VMX preemption timer is armed when the vCPU is
> > running, and a VMExit will happen if the virtual TSC deadline timer
> > expires.
> > 
> > When the vCPU thread is scheduled out, the tsc deadline timer
> > virtualization will be switched to use the current solution, i.e.
> > use the timer for it. It's switched back to VMX preemption timer
> > when the vCPU thread is scheduled int.
> > 
> > This solution avoids the complex OS's hrtimer system, and also the
> > host timer interrupt handling cost, with a preemption_timer VMexit.
> > It fits well for some NFV usage scenario, when the vCPU is bound to
> > a pCPU and the pCPU is isolated, or some similar scenario.
> > 
> > However, it possibly has impact if the vCPU thread is scheduled
> > in/out very frequently, because it switches from/to the hrtimer
> > emulation a lot.
> > 
> > Signed-off-by: Yunhong Jiang <yunhong.jiang@xxxxxxxxx>
> > ---
> >  arch/x86/kvm/lapic.c | 97
> > ++++++++++++++++++++++++++++++++++++++++++++++++++--
> > arch/x86/kvm/lapic.h | 11 ++++++ arch/x86/kvm/trace.h | 22
> > ++++++++++++ arch/x86/kvm/vmx.c   |  8 +++++
> >  arch/x86/kvm/x86.c   |  4 +++
> >  5 files changed, 139 insertions(+), 3 deletions(-)
> 
> Thanks, this looks very nice.
> 
> As I mentioned in my reply to Marcelo, I have some more changes in
> mind to reduce the overhead.  This way we can enable it
> unconditionally (on processors that do not have the erratum).   In
> particular:
> 
> 1) While a virtual machine is scheduled out due to contention, you
> don't care if the timer fires.  You can fire the timer immediately
> after the next vmentry.  You only need to set a timer during HLT.
> So, rather than sched_out/sched_in, you can start and stop the hrtimer
> in vmx_pre_block and vmx_post_block.  You probably should rename what
> is now vmx_pre_block and vmx_post_block to something like
> pi_pre_block and pi_post_block, so that vmx_pre_block and
> vmx_post_block are like
> 
> 	if (pi_pre_block(vcpu))
> 		return 1;
> 	if (have preemption timer) {
> 		kvm_lapic_switch_to_sw_timer(vcpu);
> 
> and
> 
> 	if (have preemption timer)
> 		kvm_lapic_switch_to_hv_timer(vcpu);
> 	pi_post_block(vcpu);
> 
> respectively.
> 
> You'll also need to check whether the deadlock in
> switch_to_sw_lapic_timer is still possible.  Hopefully not---if so,
> simpler code! :)
> 
> 2) if possible, I would like to remove kvm_lapic_arm_hv_timer.
> Instead, make switch_to_hv_lapic_timer and start_apic_timer can call
> kvm_x86_ops->set_hv_timer, passing the desired *guest* TSC.
> set_hv_timer can convert the guest TSC to a host TSC and save the
> desired host TSC in struct vmx_vcpu.  vcpu_run checks if there is a

Hi, Paolo, I mostly finished my patch, but one thing come to my mind in
the last minutes and hope to get your input.
One issue to save the host TSC on the vmx_vcpu is, if the vCPU thread
is migrated to another pCPU, and the TSC are not synchronized between
these two pCPUs, then the result is not accurate. In previous patchset,
it's ok because we do the calculation on the vcpu_run() thus the
migration does not matter.

This also bring a tricky situation considering the VMX preepmtion timer
can only hold 32bit value. In a rare situation, the host delta TSC is
less than 32 bit, however, if there is a host tsc backwards after the
migration, then the delta TSC on the new CPU may be larger than 32 bit,
but we can't switch back to sw timer anymore because it's too late. I
add some hacky code on the kvm_arch_vcpu_load(), not sure if it's ok.

I will send a RFC patch. I'm still looking for a
platform with TSC scaling support, thus not test TSC scaling
yet. Others has been tested.

> TSC deadline and, if so, writes the delta into
> VMX_PREEMPTION_TIMER_VALUE. set_hv_timer/cancel_hv_timer only write
> to PIN_BASED_VM_EXEC_CONTROL.
> 
> Besides making vcpu_run faster, I think that with this change the
> distinction between HV_TIMER_NEEDS_ARMING and HV_TIMER_ARMED
> disappears, and struct kvm_lapic can simply have a bool
> hv_timer_in_use.  Again, it might actually end up making the code
> simpler, especially the interface between lapic.c and vmx.c.
> 
> The hardest part is converting guest_tsc to host_tsc.  From
> 
> 	guest_tsc = (host_tsc * vcpu->arch.tsc_scaling_ratio >>
> 			kvm_tsc_scaling_ratio_frac_bits) + tsc_offset
> 
> you have
> 
> 	host_tsc = ((unsigned __int128)(guest_tsc - tsc_offset)
> 			<< kvm_tsc_scaling_ratio_frac_bits)
> 			/ vcpu->arch.tsc_scaling_ratio;
> 
> ... except that you need a division with a 128-bit dividend and a
> 64-bit divisor here, and there is no such thing in Linux.  It's okay
> if you restrict this to 64-bit hosts and use the divq assembly
> instruction directly.  (You also need to trap the divide overflow
> exception; if there is an overflow just disable the preemption

I simply check if the calculation will cause overflow. A divide error
exception is too much for the target usage scenario.

Thanks
--jyh
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html