> On 15 Apr 2019, at 19:11, Sean Christopherson <sean.j.christopherson@xxxxxxxxx> wrote: > > On Sun, Apr 14, 2019 at 02:25:44PM +0300, Liran Alon wrote: >> >> >>> On 12 Apr 2019, at 23:18, Sean Christopherson <sean.j.christopherson@xxxxxxxxx> wrote: >>> >>> To minimize the latency of timer interrupts as observed by the guest, >>> KVM adjusts the values it programs into the host timers to account for >>> the host's overhead of programming and handling the timer event. In >>> the event that the adjustments are too aggressive, i.e. the timer fires >>> earlier than the guest expects, KVM busy waits immediately prior to >>> entering the guest. >>> >>> Currently, KVM manually converts the delay from nanoseconds to clock >>> cycles. But, the conversion is done in the guest's time domain, while >>> the delay occurs in the host's time domain, i.e. the delay may not be >>> accurate and could wait too little or too long. >> >> Just to make sure I understand the issue you see here: >> The __delay() is called with a parameter which the number of guest cycles required >> to get from current guest cycle (guest_tsc) to deadline cycle (tsc_deadline). >> But in case guest vCPU frequency is different than physical CPU frequency, >> this parameter should further be converted from guest cycles to host cycles. >> Do I understand correctly? If yes, I think you should elaborate this part a bit more in commit message. >> As this is quite confusing. :) > > Heh, I didn't elaborate precisely because it was so darn confusing. I > didn't want to say something completely incorrect so I kept it high level. Heh, I can agree with that :) But we should make sure others who read commit message won’t get confuse. > >> >>> Sidestep the headache >>> of shifting time domains by delaying 1ns at a time and breaking the loop >>> when the guest's desired timer delay has been satisfied. Because the >>> advancement, which caps the delay to avoid unbounded busy waiting, is >>> stored in nanoseconds, the current advancement time can simply be used >>> as a loop terminator since we're delaying 1ns at a time (plus the few >>> cycles of overhead for running the code). >> >> Looking at start_sw_tscdeadline(), we have the same issue where we require to convert guest cycles to host ns time in order to set hrtimer expiration time. >> This is simply done by: >> host_ns = guest_cycles * 1000000ULL; >> do_div(host_ns, vcpu->arch.virtual_tsc_khz); >> >> Why shouldn’t we use the same approach for constructing __delay() parameter from guest_cycles currently pass to it as parameter? >> i.e. something such as: >> delay_guest_cycles = min(tsc_deadline - guest_tsc, nsec_to_cycles(vcpu, lapic_timer_advance_ns)); >> delay_host_ns = delay_guest_cycles * 1000000ULL; >> do_div(delay_host_ns, vcpu->arch.virtual_tsc_khz); >> __delay(delay_host_ns); > > That's also an option. I opted for the loop as it felt simpler and more > obvious. And explicitly checking rechecking kvm_read_l1_tsc() guarantees > the guest won't see an "early" TSC unless the cap kicks in. That being > said, I don't have a strong opinion either way. > > As for the code, we'd probably want to do the conversion first and cap > second so as to avoid converting lapic_timer_advance_ns to the guest's > TSC and then back to ns. It should use ndelay(), and maybe just call it > "delay_ns" to avoid (temporarily) assigning a guest TSC calculation to a > host variable, e.g.: > > delay_ns = (tsc_deadline - guest_tsc) * 1000000ULL; > do_div(delay_ns, vcpu->arch.virtual_tsc_khz); > ndelay(min(delay_ns, lapic_timer_advance_ns)); > > That's actually nowhere near as as bad as I thought it would be now that > it's typed out… Yes I agree this code looks nice. I think it’s better. I recommend sending a v2 with this version. > > >> >> -Liran >> >>> >>> Cc: Liran Alon <liran.alon@xxxxxxxxxx> >>> Cc: Wanpeng Li <wanpengli@xxxxxxxxxxx> >>> Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx> >>> --- >>> arch/x86/kvm/lapic.c | 13 +++++++------ >>> 1 file changed, 7 insertions(+), 6 deletions(-) >>> >>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c >>> index 92446cba9b24..e797e3145a8b 100644 >>> --- a/arch/x86/kvm/lapic.c >>> +++ b/arch/x86/kvm/lapic.c >>> @@ -1486,7 +1486,8 @@ static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu) >>> void wait_lapic_expire(struct kvm_vcpu *vcpu) >>> { >>> struct kvm_lapic *apic = vcpu->arch.apic; >>> - u64 guest_tsc, tsc_deadline, ns; >>> + u32 timer_advance_ns = lapic_timer_advance_ns; >>> + u64 guest_tsc, tmp_tsc, tsc_deadline, ns; >> >> Nit: I would rename guest_tsc and tmp_tsc to guest_tsc_on_exit and current_guest_tsc respectively. > > guest_tsc_on_exit isn't correct though, as it holds the guest's TSC at the > beginning of wait_lapic_expire(), e.g. guest_tsc_start would be more > appropriate. My main motivation for not using verbose names is keep lines > short, i.e. avoid wrapping at 80 chars. For me, wrapping a bunch of lines > makes the code more difficult to read than less verbose names. I think it doesn’t matter now if we agree with the above mentioned suggestion of avoiding this loop. :) -Liran > >> >>> >>> if (!lapic_in_kernel(vcpu)) >>> return; >>> @@ -1499,13 +1500,13 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu) >>> >>> tsc_deadline = apic->lapic_timer.expired_tscdeadline; >>> apic->lapic_timer.expired_tscdeadline = 0; >>> - guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); >>> + tmp_tsc = guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); >>> trace_kvm_wait_lapic_expire(vcpu->vcpu_id, guest_tsc - tsc_deadline); >>> >>> - /* __delay is delay_tsc whenever the hardware has TSC, thus always. */ >>> - if (guest_tsc < tsc_deadline) >>> - __delay(min(tsc_deadline - guest_tsc, >>> - nsec_to_cycles(vcpu, lapic_timer_advance_ns))); >>> + for (ns = 0; tmp_tsc < tsc_deadline && ns < timer_advance_ns; ns++) { >>> + ndelay(1); >>> + tmp_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); >>> + } >>> >>> if (!lapic_timer_advance_adjust_done) { >>> /* too early */ >>> -- >>> 2.21.0 >>> >>