Re: [PATCH 2/7] KVM: lapic: Delay 1ns at a time when waiting for timer to "expire"

Liran Alon <liran.alon@xxxxxxxxxx> · Mon, 15 Apr 2019 20:06:48 +0300

> On 15 Apr 2019, at 19:11, Sean Christopherson <sean.j.christopherson@xxxxxxxxx> wrote:
> 
> On Sun, Apr 14, 2019 at 02:25:44PM +0300, Liran Alon wrote:
>> 
>> 
>>> On 12 Apr 2019, at 23:18, Sean Christopherson <sean.j.christopherson@xxxxxxxxx> wrote:
>>> 
>>> To minimize the latency of timer interrupts as observed by the guest,
>>> KVM adjusts the values it programs into the host timers to account for
>>> the host's overhead of programming and handling the timer event.  In
>>> the event that the adjustments are too aggressive, i.e. the timer fires
>>> earlier than the guest expects, KVM busy waits immediately prior to
>>> entering the guest.
>>> 
>>> Currently, KVM manually converts the delay from nanoseconds to clock
>>> cycles.  But, the conversion is done in the guest's time domain, while
>>> the delay occurs in the host's time domain, i.e. the delay may not be
>>> accurate and could wait too little or too long.  
>> 
>> Just to make sure I understand the issue you see here:
>> The __delay() is called with a parameter which the number of guest cycles required
>> to get from current guest cycle (guest_tsc) to deadline cycle (tsc_deadline).
>> But in case guest vCPU frequency is different than physical CPU frequency,
>> this parameter should further be converted from guest cycles to host cycles.
>> Do I understand correctly? If yes, I think you should elaborate this part a bit more in commit message.
>> As this is quite confusing. :)
> 
> Heh, I didn't elaborate precisely because it was so darn confusing.  I
> didn't want to say something completely incorrect so I kept it high level.

Heh, I can agree with that :)
But we should make sure others who read commit message won’t get confuse.

> 
>> 
>>> Sidestep the headache
>>> of shifting time domains by delaying 1ns at a time and breaking the loop
>>> when the guest's desired timer delay has been satisfied.  Because the
>>> advancement, which caps the delay to avoid unbounded busy waiting, is
>>> stored in nanoseconds, the current advancement time can simply be used
>>> as a loop terminator since we're delaying 1ns at a time (plus the few
>>> cycles of overhead for running the code).
>> 
>> Looking at start_sw_tscdeadline(), we have the same issue where we require to convert guest cycles to host ns time in order to set hrtimer expiration time.
>> This is simply done by:
>> host_ns = guest_cycles * 1000000ULL;
>> do_div(host_ns, vcpu->arch.virtual_tsc_khz);
>> 
>> Why shouldn’t we use the same approach for constructing __delay() parameter from guest_cycles currently pass to it as parameter?
>> i.e. something such as:
>> delay_guest_cycles = min(tsc_deadline - guest_tsc, nsec_to_cycles(vcpu, lapic_timer_advance_ns));
>> delay_host_ns = delay_guest_cycles * 1000000ULL;
>> do_div(delay_host_ns, vcpu->arch.virtual_tsc_khz);
>> __delay(delay_host_ns);
> 
> That's also an option.  I opted for the loop as it felt simpler and more
> obvious.  And explicitly checking rechecking kvm_read_l1_tsc() guarantees
> the guest won't see an "early" TSC unless the cap kicks in.  That being
> said, I don't have a strong opinion either way.
> 
> As for the code, we'd probably want to do the conversion first and cap
> second so as to avoid converting lapic_timer_advance_ns to the guest's
> TSC and then back to ns.  It should use ndelay(), and maybe just call it
> "delay_ns" to avoid (temporarily) assigning a guest TSC calculation to a
> host variable, e.g.:
> 
>  delay_ns = (tsc_deadline - guest_tsc) * 1000000ULL;
>  do_div(delay_ns, vcpu->arch.virtual_tsc_khz);
>  ndelay(min(delay_ns, lapic_timer_advance_ns));
> 
> That's actually nowhere near as as bad as I thought it would be now that
> it's typed out…

Yes I agree this code looks nice. I think it’s better. I recommend sending a v2 with this version.

> 
> 
>> 
>> -Liran
>> 
>>> 
>>> Cc: Liran Alon <liran.alon@xxxxxxxxxx>
>>> Cc: Wanpeng Li <wanpengli@xxxxxxxxxxx>
>>> Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
>>> ---
>>> arch/x86/kvm/lapic.c | 13 +++++++------
>>> 1 file changed, 7 insertions(+), 6 deletions(-)
>>> 
>>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>>> index 92446cba9b24..e797e3145a8b 100644
>>> --- a/arch/x86/kvm/lapic.c
>>> +++ b/arch/x86/kvm/lapic.c
>>> @@ -1486,7 +1486,8 @@ static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu)
>>> void wait_lapic_expire(struct kvm_vcpu *vcpu)
>>> {
>>> 	struct kvm_lapic *apic = vcpu->arch.apic;
>>> -	u64 guest_tsc, tsc_deadline, ns;
>>> +	u32 timer_advance_ns = lapic_timer_advance_ns;
>>> +	u64 guest_tsc, tmp_tsc, tsc_deadline, ns;
>> 
>> Nit: I would rename guest_tsc and tmp_tsc to guest_tsc_on_exit and current_guest_tsc respectively.
> 
> guest_tsc_on_exit isn't correct though, as it holds the guest's TSC at the
> beginning of wait_lapic_expire(), e.g. guest_tsc_start would be more
> appropriate.  My main motivation for not using verbose names is keep lines
> short, i.e. avoid wrapping at 80 chars.  For me, wrapping a bunch of lines
> makes the code more difficult to read than less verbose names.

I think it doesn’t matter now if we agree with the above mentioned suggestion of avoiding this loop. :)

-Liran

> 
>> 
>>> 
>>> 	if (!lapic_in_kernel(vcpu))
>>> 		return;
>>> @@ -1499,13 +1500,13 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu)
>>> 
>>> 	tsc_deadline = apic->lapic_timer.expired_tscdeadline;
>>> 	apic->lapic_timer.expired_tscdeadline = 0;
>>> -	guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
>>> +	tmp_tsc = guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
>>> 	trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline);
>>> 
>>> -	/* __delay is delay_tsc whenever the hardware has TSC, thus always.  */
>>> -	if (guest_tsc < tsc_deadline)
>>> -		__delay(min(tsc_deadline - guest_tsc,
>>> -			nsec_to_cycles(vcpu, lapic_timer_advance_ns)));
>>> +	for (ns = 0; tmp_tsc < tsc_deadline && ns < timer_advance_ns; ns++) {
>>> +		ndelay(1);
>>> +		tmp_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
>>> +	}
>>> 
>>> 	if (!lapic_timer_advance_adjust_done) {
>>> 		/* too early */
>>> -- 
>>> 2.21.0
>>> 
>>