Re: [PATCH 12/14] KVM: retpolines: x86: eliminate retpoline from vmx.c exit handlers

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Wed, 16 Oct 2019 12:50:57 -0400

On Wed, Oct 16, 2019 at 09:07:39AM +0200, Paolo Bonzini wrote:
> Yet you would add CPUID to the list even though it is not even there in
> your benchmarks, and is *never* invoked in a hot path by *any* sane

I justified CPUID as a "short term" benchmark gadget, it's one of
those it shouldn't be a problem at all to remove, I couldn't possibly
be against removing it. I only pointed out the fact cpuid on any
modern linux guest is going to run more frequently than any inb/outb
so if I had to pick a benchmark gadget, that remains my favorite one.

> program? Some OSes have never gotten virtio 1.0 drivers.  OpenBSD only
> got it earlier this year.

If the target is an optimization to a cranky OS that can't upgrade
virtio to obtain the full performance benefit from the retpoline
removal too (I don't know the specifics by just reading the above)
then it's a better argument. At least it sounds fair enough not to
unfair penalize the cranky OS forced to run obsolete protocols that
nobody can update or has the time to update.

I mean, until you said there's some OS that cannot upgrade to virtio
1.0, I thought it was perfectly fine to say "if you want to run a
guest with the full benefit of virtio 1.0 on KVM, you should upgrade
to virtio 1.0 and not stick to whatever 3 year old protocol, then also
the inb/outb retpoline will go away if you upgrade the host because
the inb/outb will go away in the first place".

> It still doesn't add up.  0.3ms / 5 is 1/15000th of a second; 43us is
> 1/25000th of a second.  Do you have multiple vCPU perhaps?

Why would I run any test on UP guests? Rather then spending time doing
the math on my results, it's probably quicker that you run it yourself:

https://lkml.kernel.org/r/20190109034941.28759-1-aarcange@xxxxxxxxxx/

Marcelo should have better reproducers for frequent HLT that is a real
workload we have to pass, I reported the first two random things I had
around that reported fairly frequent HLT. The pipe loop load is
similar to local network I/O.

> The number of vmexits doesn't count (for HLT).  What counts is how long
> they take to be serviced, and as long as it's 1us or more the
> optimization is pointless.
> 
> Consider these pictures
> 
>          w/o optimization                   with optimization
>          ----------------------             -------------------------
> 0us      vmexit                             vmexit
> 500ns    retpoline                          call vmexit handler directly
> 600ns    retpoline                          kvm_vcpu_check_block()
> 700ns    retpoline                          kvm_vcpu_check_block()
> 800ns    kvm_vcpu_check_block()             kvm_vcpu_check_block()
> 900ns    kvm_vcpu_check_block()             kvm_vcpu_check_block()
> ...
> 39900ns  kvm_vcpu_check_block()             kvm_vcpu_check_block()
> 
>                             <interrupt arrives>
> 
> 40000ns  kvm_vcpu_check_block()             kvm_vcpu_check_block()
> 
> 
> Unless the interrupt arrives exactly in the few nanoseconds that it
> takes to execute the retpoline, a direct handling of HLT vmexits makes
> *absolutely no difference*.
> 

You keep focusing on what happens if the host is completely idle (in
which case guest HLT is a slow path) and you keep ignoring the case
that the host isn't completely idle (in which case guest HLT is not a
slow path).

Please note the single_task_running() check which immediately breaks
the kvm_vcpu_check_block() loop if there's even a single other task
that can be scheduled in the runqueue of the host CPU.

What happen when the host is not idle is quoted below:

         w/o optimization                   with optimization
         ----------------------             -------------------------
0us      vmexit                             vmexit
500ns    retpoline                          call vmexit handler directly
600ns    retpoline                          kvm_vcpu_check_block()
700ns    retpoline                          schedule()
800ns    kvm_vcpu_check_block()
900ns    schedule()
...

Disclaimer: the numbers on the left are arbitrary and I just cut and
pasted them from yours, no idea how far off they are.

To be clear, I would find it very reasonable to be requested to proof
the benefit of the HLT optimization with benchmarks specifics for that
single one liner, but until then, the idea that we can drop the
retpoline optimization from the HLT vmexit by just thinking about it,
still doesn't make sense to me, because by thinking about it I come to
the opposite conclusion.

The lack of single_task_running() in the guest driver is also why the
guest cpuidle haltpoll risks to waste some CPU with host overcommit or
with the host loaded at full capacity and why we may not assume it to
be universally enabled.

Thanks,
Andrea