Guest entry/exit performance work and observations

Nicholas Piggin <npiggin@xxxxxxxxx> · Mon, 26 Apr 2021 21:27:18 +1000

I'm looking at KVM HV P9 path guest exit/entry performance with the Cify
patches, plus some further work to see what we can do.

Measurement is done in the guest making a "NULL hcall" and return back 
to a non-nested guest. Two cases considered: First, returning to guest 
at the "try_real_mode" hcall handler. Second, returning back to guest 
after going around a loop in kvmppc_vcpu_run_hv (i.e., exit into full 
host kernel context, but not to host usermode).

The real-mode test is a proxy for real mode hcall and other interrupt
handlers, and the full exit is a proxy for virtual mode hcalls and 
interrupt handlers.

The test was done with powernv_defconfig, radix guest and radix host on
a POWER9 with meltdown mitigations disabled. A minor hack was made
just to get the immediate return / NULL hcall behaviour to measure
performance.

* Upstream try_real_mode return	-  509 cycles
* Upstream virt NULL hcall	- 9587 cycles
* KVM Cify virt NULL hcall	- 9333 cycles
* KVM Cify+opt virt NULL hcall	- 5754 cycles (167% faster than upstream,
				               or 60% the cycles required)

The KVM Cify series (which you have already seen) plus the further
optimisations patch series is here:

https://github.com/npiggin/linux/tree/kvm-in-c-new

Some of the important / major further optimisation patches have
individual cycle time improvement contribution annotated. In many cases
things are inter-dependent, e.g., patch A might improve 100 cycles and
B 50 cycles but A+B might be 250 due to together avoiding an SPR stall.
So take the individual numbers with a grain of salt, and the cumulative
result above is most important.

In summary the Cify series does not hurt performance of entry/exit, 
which is good. It actually helps a bit, I'm not sure exactly where.
And we can make quite a lot more improvement with this series.

HOWEVER! The Cify series removes the very fast real mode hcall and 
interrupt handlers (except some things like machine check). So any real 
mode handler will be handled as a virt mode handler on P9 after Cify.

Now I have some further patches in progress that should shave about 1000 
more cycles more from the full exit, but beyond that it gets pretty 
tough to improve. That still leaves it an order of magnitude slower.  

Now I did say this doesn't matter so much with a P9/radix/xive guest
which is true, except possibly for TCE hcalls that Alexey brought to my
attention (any other important cases?). So we will have to think about 
that.

Alexey did say that the real mode TCE hcalls were added for P8, and
were less important for P9, but it is something to keep an eye on. We 
might end up adding a faster handler back, but I would much prefer if
wasn't entirely run in guest context as they do today (maybe switch
MMU context, TB, and a few other important SPRs, and enable translation
so it can run practically as host kernel context). But I think we should
wait and see, and add the complexity only if it comes up as a problem.

The other thing is the P9 path now implements the P9 hash guest support 
after the Cify series. Hash does a lot more exits due to translation 
hcalls and interrupts. I did do some basic measurements (e.g., kernel 
compile) and couldn't see a significant slowdown. But in any case I 
think the P9 hash code is not important to micro optimise, it was only
done to simplify code and remove asm, so I would rather not add 
complexity for that.

Thanks,
Nick