Wanpeng pointed out that auto-tuning the advancement time for every vCPU can lead to inaccurate wait times, e.g. if the advancement is calculated while an overcommitted host is under heavy load, then KVM will waste a lot of time busy waiting if the load decreases, e.g. when neighbour VMs are idle. Sidestep this issue and reduce time spent adjusting the advancement by saving the first stable advancement value and reusing that value for all *new* vCPUs. This provides a safe way to auto-adjust the advancement, minimizes the potential for a poor calculation due to system load, and preserves the ability for userspace to change the advancement on the fly (the module parameter is writable when KVM is loaded), as the value set by userspace takes priority. Regarding changing the advancement on the fly, doing so is likely less useful with auto-adjusting, especially now that recognizing a change requires restarting the VM. Allowing the two concepts to coexist is theoretically possible, but would be ugly and slow. Auto-tuning needs to track advancement time on a per-vCPU basis (because adjusments are done relative to the last advancement), so supporting both approaches would require additional code and conditionals, i.e. overhead, but would only provide marginal value. That being said, keep the ability to change the module param without a reload as it's useful for debug and testing. Note, the comment from commit 39497d7660d9 ("KVM: lapic: Track lapic timer advance per vCPU") about TSC scaling: And because virtual_tsc_khz and tsc_scaling_ratio are per-vCPU, the correct calibration for a given vCPU may not apply to all vCPUs. was effectively resolved by commit b6aa57c69cb2 ("KVM: lapic: Convert guest TSC to host time domain if necessary"). The timer advancement is calculated and stored in nanoseconds, and operates in the host time domain, i.e. calculates *host* overhead. In short, reusing an advancement time for vCPUs with different TSC scaling is ok. Fixes: 39497d7660d9 ("KVM: lapic: Track lapic timer advance per vCPU") Cc: Wanpeng Li <kernellwp@xxxxxxxxx> Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx> --- arch/x86/kvm/lapic.c | 22 ++++++++++++++++++---- 1 file changed, 18 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index bd13fdddbdc4..4f234df11078 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -70,6 +70,8 @@ #define APIC_BROADCAST 0xFF #define X2APIC_BROADCAST 0xFFFFFFFFul +static u32 adjusted_timer_advance_ns = -1; + #define LAPIC_TIMER_ADVANCE_ADJUST_DONE 100 /* step-by-step approximation to mitigate fluctuation */ #define LAPIC_TIMER_ADVANCE_ADJUST_STEP 8 @@ -1542,6 +1544,15 @@ void wait_lapic_expire(struct kvm_vcpu *vcpu) apic->lapic_timer.timer_advance_adjust_done = true; } apic->lapic_timer.timer_advance_ns = timer_advance_ns; + + /* + * The first vCPU to get a stable advancement time "wins" and + * sets the advancement time that is used for *new* vCPUS that + * are created with auto-adjusting enabled. + */ + if (apic->lapic_timer.timer_advance_adjust_done) + (void)cmpxchg(&adjusted_timer_advance_ns, -1, + timer_advance_ns); } } @@ -2302,12 +2313,15 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu, int timer_advance_ns) hrtimer_init(&apic->lapic_timer.timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED); apic->lapic_timer.timer.function = apic_timer_fn; - if (timer_advance_ns == -1) { + if (timer_advance_ns != -1) { + apic->lapic_timer.timer_advance_ns = timer_advance_ns; + apic->lapic_timer.timer_advance_adjust_done = true; + } else if (adjusted_timer_advance_ns != -1) { + apic->lapic_timer.timer_advance_ns = adjusted_timer_advance_ns; + apic->lapic_timer.timer_advance_adjust_done = true; + } else { apic->lapic_timer.timer_advance_ns = 1000; apic->lapic_timer.timer_advance_adjust_done = false; - } else { - apic->lapic_timer.timer_advance_ns = timer_advance_ns; - apic->lapic_timer.timer_advance_adjust_done = true; } -- 2.21.0