On Tue, 2023-10-10 at 16:25 -0700, Sean Christopherson wrote: > On Tue, Oct 10, 2023, David Woodhouse wrote: > > If I understand things correctly, the point of the TDP MMU is to use > > page tables such as EPT for GPA → HPA translations, but let the > > virtualization support in the CPU handle all of the *virtual* > > addressing and page tables, including the non-root mode %cr3/%cr4. > > > > I have a guest which loves to flip the SMEP bit on and off in %cr4 all > > the time. The guest is actually Xen, in its 'PV shim' mode which > > enables it to support a single PV guest, while running in a true > > hardware virtual machine: > > https://lists.xenproject.org/archives/html/xen-devel/2018-01/msg00497.html > > > > The performance is *awful*, since as far as I can tell, on every flip > > KVM flushes the entire EPT. I understand why that might be necessary > > for the mode where KVM is building up a set of shadow page tables to > > directly map GVA → HPA and be loaded into %cr3 of a CPU that doesn't > > support native EPT translations. But I don't understand why the TDP MMU > > would need to do it. Surely we don't have to change anything in the EPT > > just because the stuff in the non-root-mode %cr3/%cr4 changes? > > > > So I tried this, and it went faster and nothing appears to have blown > > up. > > > > Am I missing something? Is this stupidly wrong? > > Heh, you're in luck, because regardless of what your darn pronoun "this" refers > to, the answer is yes, "this" is stupidly wrong. Hehe. Thought that might be the case. Thank you for the coherent explanation and especially for the references. (I hadn't got the PV shim working in qemu yet. I shall bump that up my TODO list by a few quanta, as it would have let me run this test on a newer kernel.) > The below is stupidly wrong. KVM needs to at least reconfigure the guest's paging > metadata that is used to translate GVAs to GPAs during emulation. > > But the TDP MMU behavior *was* also stupidly wrong. The reason that two vCPUs > suck less is because KVM would zap SPTEs (EPT roots) if and only if *both* vCPUs > unloaded their roots at the same time. > > Commit edbdb43fc96b ("KVM: x86: Preserve TDP MMU roots until they are explicitly > invalidated") should fix the behavior you're seeing. > > And if we want to try and make SMEP blazing fast on Intel, we can probably let > the guest write it directly, i.e. give SMEP the same treatment as CR0.WP. See > commits cf9f4c0eb169 ("KVM: x86/mmu: Refresh CR0.WP prior to checking for emulated > permission faults") and fb509f76acc8 ("KVM: VMX: Make CR0.WP a guest owned bit"). Thanks. In fact, looking at __kvm_mmu_refresh_passthrough_bits() in commit cf9f4c0eb169 ("KVM: x86/mmu: Refresh CR0.WP…"), shouldn't it be looking at the SMEP bit in %cr4 anyway? In kvm_calc_cpu_role() the value of ____is_cr0_wp() is used *three* times. For setting role.base.{cr0_wp,smep_andnot_wp,smap_andnot_wp}. But __kvm_mmu_refresh_passthrough_bits() only refreshes role.base.cr0_wp and not the other two. Do we need this? --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -5159,6 +5159,8 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu, return; mmu->cpu_role.base.cr0_wp = cr0_wp; + mmu->cpu_role.base.smep_andnot_wp = mmu->cpu_role.ext.cr4_smep && !cr0_wp; + mmu->cpu_role.base.smap_andnot_wp = mmu->cpu_role.ext.cr4_smap && !cr0_wp; reset_guest_paging_metadata(vcpu, mmu); } But I'm confused here. Even if I don't go as far as actually making CR4.SMEP a guest-owned bit, and KVM still ends up handling it in kvm_post_load_cr4()... why does KVM need to completely unload and reinit the MMU? Would it not be sufficient just to refresh the role bits, much like __kvm_mmu_refresh_passthrough_bits() does for CR0.WP? (And what about flushing the hardware TLB, as Jim mentioned. I guess if it's guest-owned we trust the CPU to do that, and if it's trapped then KVM is required to do so)? > Oh, and if your userspace is doing something silly like constantly creating and > deleting memslots, see commit 0df9dab891ff ("KVM: x86/mmu: Stop zapping invalidated > TDP MMU roots asynchronously"). Oooh, I'll play with that. Thanks! Although I bristle at 'silly' :) Userspace only wants to add and remove single overlay pages. That's not so silly, that's explicitly requested by the guest and required for PV drivers to work. AIUI it's the KVM API which means that userspace needs to stop all vCPUs, *remove* the original memslot, add back the two halves of the original memslot with the newly overlaid page in the middle, and then let all the vCPUs run again. We don't need an atomic way to change *all* the memslots at once, but a way to atomically overlay a *single* range would make the whole enterprise feel less 'silly'.
Attachment:
smime.p7s
Description: S/MIME cryptographic signature