Re: [PATCH 12/14] KVM: retpolines: x86: eliminate retpoline from vmx.c exit handlers

Paolo Bonzini <pbonzini@xxxxxxxxxx> · Wed, 16 Oct 2019 19:01:25 +0200

On 16/10/19 18:50, Andrea Arcangeli wrote:
>> It still doesn't add up.  0.3ms / 5 is 1/15000th of a second; 43us is
>> 1/25000th of a second.  Do you have multiple vCPU perhaps?
> 
> Why would I run any test on UP guests? Rather then spending time doing
> the math on my results, it's probably quicker that you run it yourself:

I don't know, but if you don't say how many vCPUs you have, I cannot do
the math and review the patch.

>> The number of vmexits doesn't count (for HLT).  What counts is how long
>> they take to be serviced, and as long as it's 1us or more the
>> optimization is pointless.
>
> Please note the single_task_running() check which immediately breaks
> the kvm_vcpu_check_block() loop if there's even a single other task
> that can be scheduled in the runqueue of the host CPU.
> 
> What happen when the host is not idle is quoted below:
> 
>          w/o optimization                   with optimization
>          ----------------------             -------------------------
> 0us      vmexit                             vmexit
> 500ns    retpoline                          call vmexit handler directly
> 600ns    retpoline                          kvm_vcpu_check_block()
> 700ns    retpoline                          schedule()
> 800ns    kvm_vcpu_check_block()
> 900ns    schedule()
> ...
> 
> Disclaimer: the numbers on the left are arbitrary and I just cut and
> pasted them from yours, no idea how far off they are.

Yes, of course.  But the idea is the same: yes, because of the retpoline
you run the guest for perhaps 300ns more before schedule()ing, but does
that really matter?  300ns * 20000 times/second is a 0.6% performance
impact, and 300ns is already very generous.  I am not sure it would be
measurable at all.

Paolo

> To be clear, I would find it very reasonable to be requested to proof
> the benefit of the HLT optimization with benchmarks specifics for that
> single one liner, but until then, the idea that we can drop the
> retpoline optimization from the HLT vmexit by just thinking about it,
> still doesn't make sense to me, because by thinking about it I come to
> the opposite conclusion.
> 
> The lack of single_task_running() in the guest driver is also why the
> guest cpuidle haltpoll risks to waste some CPU with host overcommit or
> with the host loaded at full capacity and why we may not assume it to
> be universally enabled.
> 
> Thanks,
> Andrea
>