Re: KVM: x86: __wait_lapic_expire silently using TPAUSE C0.2

Sean Christopherson <seanjc@xxxxxxxxxx> · Wed, 11 Sep 2024 14:32:12 -0700

On Tue, Sep 10, 2024, Jon Kohler wrote:
> How about something like this for the regular ol’ PAUSE route?
> 
> Note: the ndelay side would likely be a bit more annoying to handle to internalize
> to KVM, but perhaps we could just have delay library return the amount of cycles,
> and then do the loop I’ve got as a separate, KVM only func?
> 
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1627,16 +1627,39 @@ static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu)
>  static inline void __wait_lapic_expire(struct kvm_vcpu *vcpu, u64 guest_cycles)
>  {
>         u64 timer_advance_ns = vcpu->arch.apic->lapic_timer.timer_advance_ns;
> +       u64 start, end, delay_by = 0;
>  
>         /*
>          * If the guest TSC is running at a different ratio than the host, then
> -        * convert the delay to nanoseconds to achieve an accurate delay.  Note
> -        * that __delay() uses delay_tsc whenever the hardware has TSC, thus
> -        * always for VMX enabled hardware.
> +        * convert the delay to nanoseconds to achieve an accurate delay.
> +        *
> +        * Note: open code delay function as KVM's use case is a bit special, as
> +        * we know we need to reenter the guest at a specific time; however, the
> +        * delay library may introduce architectural delays that we do not want,
> +        * such as using TPAUSE. Our mission is to simply get into the guest as
> +        * soon as possible without violating architectural constraints.
> +        * RFC: Keep ndelay for help converting to nsec? or pull that in too?
>          */
>         if (vcpu->arch.tsc_scaling_ratio == kvm_caps.default_tsc_scaling_ratio) {
> -               __delay(min(guest_cycles,
> -                       nsec_to_cycles(vcpu, timer_advance_ns)));
> +               delay_by = min(guest_cycles, 
> +                                          nsec_to_cycles(vcpu, timer_advance_ns));
> +
> +               if (delay_by == 0) {
> +                       return;
> +               } else {
> +                       start = rdtsc();
> +
> +                       for (;;) {
> +                               cpu_relax();
> +                               end = rdtsc();
> +
> +                               if (delay_by <= end - start)
> +                                       break;
> +
> +                               delay_by -= end - start;
> +                               start = end;
> +                       }

I don't want to replicate __delay() in KVM.  If the delay is short, KVM should
busy wait and intentionally NOT do PAUSE, i.e. not yield to the SMT sibling(s).
Because unlike most uses of PAUSE, this CPU isn't waiting on any resource except
time itself.  E.g. there's no need to give a different CPU cycles so that the
other CPU can finish a critical section.

And again, entering the guest so that the vCPU can service the IRQ is a priority,
e.g. see all of the ChromeOS work around paravirt scheduling to boost the priority
of vCPUs that have pending IRQs.  Not to mention the physical CPU is running with
IRQs disabled, i.e. if a _host_ IRQ becomes pending, then entering the guest is a
priority in order to recognize that IRQ as well.

> +               }
>         } else {
>                 u64 delay_ns = guest_cycles * 1000000ULL;
>                 do_div(delay_ns, vcpu->arch.virtual_tsc_khz);

Whatever we do, we should do the same thing irrespective of whether or not TSC
scaling is in use.  I don't think we have an existing helper to unscale a guest
TSC, but it shouldn't be hard to add (do math).

E.g. something like:

	delay_cycles = kvm_unscale_tsc(guest_cycles,
				       vcpu->arch.tsc_scaling_ratio);
	delay_cyles = min(guest_cycles, nsec_to_cycles(vcpu, timer_advance_ns));

	if (delay_cycles > N) {
		__delay(delay_cycles);
		return;
	}

	for (start = rdtsc(); rdtsc() - start < delay_cycles); )
		;

And then I also think we (handwavy "we") should do some profiling to ensure KVM
isn't regularly waiting for more than N cycles, where N is probably something
like 200.  Because if KVM is regularly waiting for 200+ cycles, then we likely
need to re-tune the timer_advance_ns logic to be more conservative, i.e. err more
on the side of the timer arriving slightly late so as not to waste CPU cycles.