On Mon, Oct 30, 2023, Yan Zhao wrote: > On Fri, Oct 27, 2023 at 04:13:36PM -0700, Sean Christopherson wrote: > > E.g. I have a very hard time believing a real world guest kernel mucks with the > > MTRRs to setup DMA. And again, this is supported by the absense of bug reports > > on AMD. > > > > > > Yan, > > > > You've been digging into this code recently, am I forgetting something because > > it's late on a Friday? Or have we been making the very bad assumption that KVM > > code from 10+ years ago actually makes sense? I.e. for non-coherent DMA, can we > > delete all of the MTRR insanity and simply clear IPAT? > Not sure if there are guest drivers can program PAT as WB but treat memory type > as UC. > In theory, honoring guest MTRRs is the most safe way. > Do you think a complete analyse of all corner cases are deserved? I 100% agree that honoring guest MTRRs is the safest, but KVM's current approach make no sense, at all. From a hardware virtualization perspective of guest MTRRs, there is _nothing_ special about EPT. Legacy shadowing paging doesn't magically account for guest MTRRs, nor does NPT. The only wrinkle is that NPT honors gCR0, i.e. actually puts the CPU caches into no-fill mode, whereas VMX does nothing and forces KVM to (poorly) emulate that behavior by forcing UC. TL;DR of the below: Rather than try to make MTRR virtualization suck less for EPT, I think we should delete that code entirely and take a KVM errata to formally document that KVM doesn't virtualize guest MTRRs. In addition to solving the performance issues with zapping SPTEs for MTRR changes, that'll eliminate 600+ lines of complex code (the overlay shenanigans used for fixed MTRRs are downright mean). arch/x86/include/asm/kvm_host.h | 15 +--- arch/x86/kvm/mmu/mmu.c | 16 ---- arch/x86/kvm/mtrr.c | 644 ++++++------------------------------------------------------------------------------------------------------------------------------- arch/x86/kvm/vmx/vmx.c | 12 +-- arch/x86/kvm/x86.c | 1 - arch/x86/kvm/x86.h | 4 - 6 files changed, 36 insertions(+), 656 deletions(-) Digging deeper through the history, this *mostly* appears to be the result of coming to the complete wrong conclusion for handling memtypes during EPT and VT-d enabling. The zapping GFNs logic came from commit efdfe536d8c643391e19d5726b072f82964bfbdb Author: Xiao Guangrong <guangrong.xiao@xxxxxxxxxxxxxxx> Date: Wed May 13 14:42:27 2015 +0800 KVM: MMU: fix MTRR update Currently, whenever guest MTRR registers are changed kvm_mmu_reset_context is called to switch to the new root shadow page table, however, it's useless since: 1) the cache type is not cached into shadow page's attribute so that the original root shadow page will be reused 2) the cache type is set on the last spte, that means we should sync the last sptes when MTRR is changed This patch fixs this issue by drop all the spte in the gfn range which is being updated by MTRR which was a fix for commit 0bed3b568b68e5835ef5da888a372b9beabf7544 Author: Sheng Yang <sheng@xxxxxxxxxxxxxxx> AuthorDate: Thu Oct 9 16:01:54 2008 +0800 Commit: Avi Kivity <avi@xxxxxxxxxx> CommitDate: Wed Dec 31 16:51:44 2008 +0200 KVM: Improve MTRR structure As well as reset mmu context when set MTRR. (side topic, if anyone wonders why I am so particular about changelogs, the above is exactly Anyways, the above was part of a "MTRR/PAT support for EPT" series that also added + if (mt_mask) { + mt_mask = get_memory_type(vcpu, gfn) << + kvm_x86_ops->get_mt_mask_shift(); + spte |= mt_mask; + } where get_memory_type() was a truly gnarly helper to retrive the guest MTRR memtype for a given memtype. And *very* subtly, at the time of that change, KVM *always* set VMX_EPT_IGMT_BIT, kvm_mmu_set_base_ptes(VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK | VMX_EPT_DEFAULT_MT << VMX_EPT_MT_EPTE_SHIFT | VMX_EPT_IGMT_BIT); which came in via commit 928d4bf747e9c290b690ff515d8f81e8ee226d97 Author: Sheng Yang <sheng@xxxxxxxxxxxxxxx> AuthorDate: Thu Nov 6 14:55:45 2008 +0800 Commit: Avi Kivity <avi@xxxxxxxxxx> CommitDate: Tue Nov 11 21:00:37 2008 +0200 KVM: VMX: Set IGMT bit in EPT entry There is a potential issue that, when guest using pagetable without vmexit when EPT enabled, guest would use PAT/PCD/PWT bits to index PAT msr for it's memory, which would be inconsistent with host side and would cause host MCE due to inconsistent cache attribute. The patch set IGMT bit in EPT entry to ignore guest PAT and use WB as default memory type to protect host (notice that all memory mapped by KVM should be WB). Note the CommitDates! The AuthorDates strongly suggests Sheng Yang added the whole IGMT things as a bug fix for issues that were detected during EPT + VT-d + passthrough enabling, but Avi applied it earlier because it was a generic fix. Jumping back to 0bed3b568b68 ("KVM: Improve MTRR structure"), the other relevant code, or rather lack thereof, is the handling of *host* MMIO. That fix came in a bit later, but given the author and timing, I think it's safe to say it was all part of the same EPT+VT-d enabling mess. commit 2aaf69dcee864f4fb6402638dd2f263324ac839f Author: Sheng Yang <sheng@xxxxxxxxxxxxxxx> AuthorDate: Wed Jan 21 16:52:16 2009 +0800 Commit: Avi Kivity <avi@xxxxxxxxxx> CommitDate: Sun Feb 15 02:47:37 2009 +0200 KVM: MMU: Map device MMIO as UC in EPT Software are not allow to access device MMIO using cacheable memory type, the patch limit MMIO region with UC and WC(guest can select WC using PAT and PCD/PWT). In addition to the host MMIO and IGMT issues, this code was obviously never tested on NPT until much later, which lends further credence to my theory/argument that this was all the result of misdiagnosed issues. Discussion from the EPT+MTRR enabling thread[*] more or less confirms that Sheng Yang was trying to resolve issues with passthrough MMIO. * Sheng Yang : Do you mean host(qemu) would access this memory and if we set it to guest : MTRR, host access would be broken? We would cover this in our shadow MTRR : patch, for we encountered this in video ram when doing some experiment with : VGA assignment. And in the same thread, there's also what appears to be confirmation of Intel running into issues with Windows XP related to a guest device driver mapping DMA with WC in the PAT. Hilariously, Avi effectively said "KVM can't modify the SPTE memtype to match the guest for EPT/NPT", which while true, completely overlooks the fact that EPT and NPT both honor guest PAT by default. /facepalm * Avi Kavity : Sheng Yang wrote: : > Yes... But it's easy to do with assigned devices' mmio, but what if guest : > specific some non-mmio memory's memory type? E.g. we have met one issue in : > Xen, that a assigned-device's XP driver specific one memory region as buffer, : > and modify the memory type then do DMA. : > : > Only map MMIO space can be first step, but I guess we can modify assigned : > memory region memory type follow guest's? : > : : With ept/npt, we can't, since the memory type is in the guest's : pagetable entries, and these are not accessible. [*] https://lore.kernel.org/all/1223539317-32379-1-git-send-email-sheng@xxxxxxxxxxxxxxx So, for the most part, what I think happened is that 15 years ago, a few engineers (a) fixed a #MC problem by ignoring guest PAT and (b) initially "fixed" passthrough device MMIO by emulating *guest* MTRRs. Except for the below case, everything since then has been a result of those two intertwined changes. The one exception, which is actually yet more confirmation of all of the above, is the revert of Paolo's attempt at "full" virtualization of guest MTRRs: commit 606decd67049217684e3cb5a54104d51ddd4ef35 Author: Paolo Bonzini <pbonzini@xxxxxxxxxx> Date: Thu Oct 1 13:12:47 2015 +0200 Revert "KVM: x86: apply guest MTRR virtualization on host reserved pages" This reverts commit fd717f11015f673487ffc826e59b2bad69d20fe5. It was reported to cause Machine Check Exceptions (bug 104091). ... commit fd717f11015f673487ffc826e59b2bad69d20fe5 Author: Paolo Bonzini <pbonzini@xxxxxxxxxx> Date: Tue Jul 7 14:38:13 2015 +0200 KVM: x86: apply guest MTRR virtualization on host reserved pages Currently guest MTRR is avoided if kvm_is_reserved_pfn returns true. However, the guest could prefer a different page type than UC for such pages. A good example is that pass-throughed VGA frame buffer is not always UC as host expected. This patch enables full use of virtual guest MTRRs. I.e. Paolo tried to add back KVM's behavior before "Map device MMIO as UC in EPT" and got the same result: machine checks, likely due to the guest MTRRs not being trustworthy/sane at all times. And FWIW, Paolo also tried to enable MTRR virtualization on NP, but that too got reverted. I read through the threads, and AFAICT no one ever found a smoking gun, i.e. exactly why emulating guest MTRRs via NPT PAT caused extremely slow boot times doesn't appear to have a definitive root cause. commit fc07e76ac7ffa3afd621a1c3858a503386a14281 Author: Paolo Bonzini <pbonzini@xxxxxxxxxx> Date: Thu Oct 1 13:20:22 2015 +0200 Revert "KVM: SVM: use NPT page attributes" This reverts commit 3c2e7f7de3240216042b61073803b61b9b3cfb22. Initializing the mapping from MTRR to PAT values was reported to fail nondeterministically, and it also caused extremely slow boot (due to caching getting disabled---bug 103321) with assigned devices. ... commit 3c2e7f7de3240216042b61073803b61b9b3cfb22 Author: Paolo Bonzini <pbonzini@xxxxxxxxxx> Date: Tue Jul 7 14:32:17 2015 +0200 KVM: SVM: use NPT page attributes Right now, NPT page attributes are not used, and the final page attribute depends solely on gPAT (which however is not synced correctly), the guest MTRRs and the guest page attributes. However, we can do better by mimicking what is done for VMX. In the absence of PCI passthrough, the guest PAT can be ignored and the page attributes can be just WB. If passthrough is being used, instead, keep respecting the guest PAT, and emulate the guest MTRRs through the PAT field of the nested page tables. The only snag is that WP memory cannot be emulated correctly, because Linux's default PAT setting only includes the other types. In other words, my reading of the tea leaves is that honoring guest MTRRs for VMX was initially a workaround of sorts for KVM ignoring guest PAT *and* for KVM not forcing UC for host MMIO. And while there *are* known cases where honoring guest MTRRs is desirable, e.g. passthrough VGA frame buffers, the desired behavior in that case is to get WC instead of UC, i.e. at this point it's for performance, not correctness. Furthermore, the complete absense of MTRR virtualization on NPT and shadow paging proves that while KVM theoretically can do better, it's by no means necessary for correctnesss. Lastly, I would argue that since kernels mostly rely on firmware to do MTRR setup, and the host typically provides guest firmware, honoring guest MTRRs is effectively honoring *host* userspace memtypes, which is also backwards, i.e. it would be far better for host userspace to communicate its desired directly to KVM (or perhaps indirectly via VMAs in the host kernel, just not through guest MTRRs).