I'm looking at KVM HV P9 path guest exit/entry performance with the Cify patches, plus some further work to see what we can do. Measurement is done in the guest making a "NULL hcall" and return back to a non-nested guest. Two cases considered: First, returning to guest at the "try_real_mode" hcall handler. Second, returning back to guest after going around a loop in kvmppc_vcpu_run_hv (i.e., exit into full host kernel context, but not to host usermode). The real-mode test is a proxy for real mode hcall and other interrupt handlers, and the full exit is a proxy for virtual mode hcalls and interrupt handlers. The test was done with powernv_defconfig, radix guest and radix host on a POWER9 with meltdown mitigations disabled. A minor hack was made just to get the immediate return / NULL hcall behaviour to measure performance. * Upstream try_real_mode return - 509 cycles * Upstream virt NULL hcall - 9587 cycles * KVM Cify virt NULL hcall - 9333 cycles * KVM Cify+opt virt NULL hcall - 5754 cycles (167% faster than upstream, or 60% the cycles required) The KVM Cify series (which you have already seen) plus the further optimisations patch series is here: https://github.com/npiggin/linux/tree/kvm-in-c-new Some of the important / major further optimisation patches have individual cycle time improvement contribution annotated. In many cases things are inter-dependent, e.g., patch A might improve 100 cycles and B 50 cycles but A+B might be 250 due to together avoiding an SPR stall. So take the individual numbers with a grain of salt, and the cumulative result above is most important. In summary the Cify series does not hurt performance of entry/exit, which is good. It actually helps a bit, I'm not sure exactly where. And we can make quite a lot more improvement with this series. HOWEVER! The Cify series removes the very fast real mode hcall and interrupt handlers (except some things like machine check). So any real mode handler will be handled as a virt mode handler on P9 after Cify. Now I have some further patches in progress that should shave about 1000 more cycles more from the full exit, but beyond that it gets pretty tough to improve. That still leaves it an order of magnitude slower. Now I did say this doesn't matter so much with a P9/radix/xive guest which is true, except possibly for TCE hcalls that Alexey brought to my attention (any other important cases?). So we will have to think about that. Alexey did say that the real mode TCE hcalls were added for P8, and were less important for P9, but it is something to keep an eye on. We might end up adding a faster handler back, but I would much prefer if wasn't entirely run in guest context as they do today (maybe switch MMU context, TB, and a few other important SPRs, and enable translation so it can run practically as host kernel context). But I think we should wait and see, and add the complexity only if it comes up as a problem. The other thing is the P9 path now implements the P9 hash guest support after the Cify series. Hash does a lot more exits due to translation hcalls and interrupts. I did do some basic measurements (e.g., kernel compile) and couldn't see a significant slowdown. But in any case I think the P9 hash code is not important to micro optimise, it was only done to simplify code and remove asm, so I would rather not add complexity for that. Thanks, Nick