Re: [PATCH 2/7] KVM: lapic: Delay 1ns at a time when waiting for timer to "expire"

Liran Alon <liran.alon@xxxxxxxxxx> · Sun, 14 Apr 2019 14:25:44 +0300

> On 12 Apr 2019, at 23:18, Sean Christopherson <sean.j.christopherson@xxxxxxxxx> wrote:
> 
> To minimize the latency of timer interrupts as observed by the guest,
> KVM adjusts the values it programs into the host timers to account for
> the host's overhead of programming and handling the timer event.  In
> the event that the adjustments are too aggressive, i.e. the timer fires
> earlier than the guest expects, KVM busy waits immediately prior to
> entering the guest.
> 
> Currently, KVM manually converts the delay from nanoseconds to clock
> cycles.  But, the conversion is done in the guest's time domain, while
> the delay occurs in the host's time domain, i.e. the delay may not be
> accurate and could wait too little or too long.  

Just to make sure I understand the issue you see here:
The __delay() is called with a parameter which the number of guest cycles required
to get from current guest cycle (guest_tsc) to deadline cycle (tsc_deadline).
But in case guest vCPU frequency is different than physical CPU frequency,
this parameter should further be converted from guest cycles to host cycles.
Do I understand correctly? If yes, I think you should elaborate this part a bit more in commit message.
As this is quite confusing. :)

> Sidestep the headache
> of shifting time domains by delaying 1ns at a time and breaking the loop
> when the guest's desired timer delay has been satisfied.  Because the
> advancement, which caps the delay to avoid unbounded busy waiting, is
> stored in nanoseconds, the current advancement time can simply be used
> as a loop terminator since we're delaying 1ns at a time (plus the few
> cycles of overhead for running the code).

Looking at start_sw_tscdeadline(), we have the same issue where we require to convert guest cycles to host ns time in order to set hrtimer expiration time.
This is simply done by:
host_ns = guest_cycles * 1000000ULL;
do_div(host_ns, vcpu->arch.virtual_tsc_khz);

Why shouldn’t we use the same approach for constructing __delay() parameter from guest_cycles currently pass to it as parameter?
i.e. something such as:
delay_guest_cycles = min(tsc_deadline - guest_tsc, nsec_to_cycles(vcpu, lapic_timer_advance_ns));
delay_host_ns = delay_guest_cycles * 1000000ULL;
do_div(delay_host_ns, vcpu->arch.virtual_tsc_khz);
__delay(delay_host_ns);

-Liran

> 
> Cc: Liran Alon <liran.alon@xxxxxxxxxx>
> Cc: Wanpeng Li <wanpengli@xxxxxxxxxxx>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
> ---
> arch/x86/kvm/lapic.c | 13 +++++++------
> 1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 92446cba9b24..e797e3145a8b 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1486,7 +1486,8 @@ static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu)
> void wait_lapic_expire(struct kvm_vcpu *vcpu)
> {
> 	struct kvm_lapic *apic = vcpu->arch.apic;
> -	u64 guest_tsc, tsc_deadline, ns;
> +	u32 timer_advance_ns = lapic_timer_advance_ns;
> +	u64 guest_tsc, tmp_tsc, tsc_deadline, ns;

Nit: I would rename guest_tsc and tmp_tsc to guest_tsc_on_exit and current_guest_tsc respectively.

> 
> 	if (!lapic_in_kernel(vcpu))
> 		return;
> @@ -1499,13 +1500,13 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
> 
> 	tsc_deadline = apic->lapic_timer.expired_tscdeadline;
> 	apic->lapic_timer.expired_tscdeadline = 0;
> -	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> +	tmp_tsc = guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> 	trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline);
> 
> -	/* __delay is delay_tsc whenever the hardware has TSC, thus always.  */
> -	if (guest_tsc < tsc_deadline)
> -		__delay(min(tsc_deadline - guest_tsc,
> -			nsec_to_cycles(vcpu, lapic_timer_advance_ns)));
> +	for (ns = 0; tmp_tsc < tsc_deadline && ns < timer_advance_ns; ns++) {
> +		ndelay(1);
> +		tmp_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> +	}
> 
> 	if (!lapic_timer_advance_adjust_done) {
> 		/* too early */
> -- 
> 2.21.0
>