That's about 800-1000 clock cycles more that can be easily peeled, by saving about 60 VMWRITEs on every exit. My numbers so far have been collected on a Haswell system vs. the Broadwell that Jim used for his KVM Forum talk, and I am now down from 22000 (compared to 18000 that Jim gave as the baseline) to 14000. Also the guest is running 4.14, so it didn't have the XSETBV and DEBUGCTL patches; that removes two ancillary exit to L1, each costing about 1000 cycles on my machine). So we are probably pretty close to VMware's 6500 cycles on Broadwell. After these patches there may still be some low-hanging fruit; the remaining large deltas between non-nested and nested workloads with lots of vmexits are: 4.80% vmx_set_cr3 4.35% native_read_msr 3.73% vmcs_load 3.65% update_permission_bitmask 2.49% _raw_spin_lock 2.37% sync_vmcs12 2.20% copy_shadow_to_vmcs12 1.19% kvm_load_guest_fpu There is a large cost associated to resetting the MMU. Making that smarter could probably be worth a 10-15% improvement; not easy, but actually even more worthwhile than that on SMP nested guests because that's where the spinlock contention comes from. The MSR accesses are probably also interesting, but I haven't tried to see what they are about. One somewhat crazy idea in that area is to set CR4.FSGSBASE at vcpu_load/sched_in and clear it at vcpu_put/sched_out. Then we could skip the costly setup of the FS/GS/kernelGS base MSRs. However the cost of writes to CR4 might make it less appealing for userspace exits; I haven't benchmarked it. Paolo Paolo Bonzini (4): KVM: VMX: split list of shadowed VMCS field to a separate file KVM: nVMX: track dirty state of non-shadowed VMCS fields KVM: nVMX: move descriptor cache handling to prepare_vmcs02_full KVM: nVMX: move other simple fields to prepare_vmcs02_full arch/x86/kvm/vmx.c | 301 +++++++++++++++++++-------------------- arch/x86/kvm/vmx_shadow_fields.h | 71 +++++++++ 2 files changed, 214 insertions(+), 158 deletions(-) create mode 100644 arch/x86/kvm/vmx_shadow_fields.h -- 1.8.3.1