Re: A question about how the KVM emulates the effect of guest MTRRs on AMD platforms

Yan Zhao <yan.y.zhao@xxxxxxxxx> · Tue, 31 Oct 2023 18:01:01 +0800

On Mon, Oct 30, 2023 at 12:24:02PM -0700, Sean Christopherson wrote:
> On Mon, Oct 30, 2023, Yan Zhao wrote:
> > On Fri, Oct 27, 2023 at 04:13:36PM -0700, Sean Christopherson wrote:
> > > E.g. I have a very hard time believing a real world guest kernel mucks with the
> > > MTRRs to setup DMA.  And again, this is supported by the absense of bug reports
> > > on AMD.
> > > 
> > > 
> > > Yan,
> > > 
> > > You've been digging into this code recently, am I forgetting something because
> > > it's late on a Friday?  Or have we been making the very bad assumption that KVM
> > > code from 10+ years ago actually makes sense?  I.e. for non-coherent DMA, can we
> > > delete all of the MTRR insanity and simply clear IPAT?
> > Not sure if there are guest drivers can program PAT as WB but treat memory type
> > as UC.
> > In theory, honoring guest MTRRs is the most safe way.
> > Do you think a complete analyse of all corner cases are deserved?
> 
> I 100% agree that honoring guest MTRRs is the safest, but KVM's current approach
> make no sense, at all.  From a hardware virtualization perspective of guest MTRRs,
> there is _nothing_ special about EPT.  Legacy shadowing paging doesn't magically
> account for guest MTRRs, nor does NPT.
> 
> The only wrinkle is that NPT honors gCR0, i.e. actually puts the CPU caches into
> no-fill mode, whereas VMX does nothing and forces KVM to (poorly) emulate that
> behavior by forcing UC.
> 
> TL;DR of the below: Rather than try to make MTRR virtualization suck less for EPT,
> I think we should delete that code entirely and take a KVM errata to formally
> document that KVM doesn't virtualize guest MTRRs.  In addition to solving the
> performance issues with zapping SPTEs for MTRR changes, that'll eliminate 600+
> lines of complex code (the overlay shenanigans used for fixed MTRRs are downright
> mean).
> 
>  arch/x86/include/asm/kvm_host.h |  15 +---
>  arch/x86/kvm/mmu/mmu.c          |  16 ----
>  arch/x86/kvm/mtrr.c             | 644 ++++++-------------------------------------------------------------------------------------------------------------------------------
>  arch/x86/kvm/vmx/vmx.c          |  12 +--
>  arch/x86/kvm/x86.c              |   1 -
>  arch/x86/kvm/x86.h              |   4 -
>  6 files changed, 36 insertions(+), 656 deletions(-)
> 
> Digging deeper through the history, this *mostly* appears to be the result of coming
> to the complete wrong conclusion for handling memtypes during EPT and VT-d enabling.
> 
> The zapping GFNs logic came from
> 
>   commit efdfe536d8c643391e19d5726b072f82964bfbdb
>   Author: Xiao Guangrong <guangrong.xiao@xxxxxxxxxxxxxxx>
>   Date:   Wed May 13 14:42:27 2015 +0800
> 
>     KVM: MMU: fix MTRR update
>     
>     Currently, whenever guest MTRR registers are changed
>     kvm_mmu_reset_context is called to switch to the new root shadow page
>     table, however, it's useless since:
>     1) the cache type is not cached into shadow page's attribute so that
>        the original root shadow page will be reused
>     
>     2) the cache type is set on the last spte, that means we should sync
>        the last sptes when MTRR is changed
>     
>     This patch fixs this issue by drop all the spte in the gfn range which
>     is being updated by MTRR
> 
> which was a fix for 
> 
>   commit 0bed3b568b68e5835ef5da888a372b9beabf7544
>   Author:     Sheng Yang <sheng@xxxxxxxxxxxxxxx>
>   AuthorDate: Thu Oct 9 16:01:54 2008 +0800
>   Commit:     Avi Kivity <avi@xxxxxxxxxx>
>   CommitDate: Wed Dec 31 16:51:44 2008 +0200
>   
>       KVM: Improve MTRR structure
>       
>       As well as reset mmu context when set MTRR.
> 
> (side topic, if anyone wonders why I am so particular about changelogs, the above
> is exactly 
> 
> Anyways, the above was part of a "MTRR/PAT support for EPT" series that also added
> 
> +	if (mt_mask) {
> +		mt_mask = get_memory_type(vcpu, gfn) <<
> +			  kvm_x86_ops->get_mt_mask_shift();
> +		spte |= mt_mask;
> +	}
> 
> where get_memory_type() was a truly gnarly helper to retrive the guest MTRR memtype
> for a given memtype.  And *very* subtly, at the time of that change, KVM *always*
> set VMX_EPT_IGMT_BIT,
> 
>         kvm_mmu_set_base_ptes(VMX_EPT_READABLE_MASK |
>                 VMX_EPT_WRITABLE_MASK |
>                 VMX_EPT_DEFAULT_MT << VMX_EPT_MT_EPTE_SHIFT |
>                 VMX_EPT_IGMT_BIT);
> 
> which came in via
> 
>   commit 928d4bf747e9c290b690ff515d8f81e8ee226d97
>   Author:     Sheng Yang <sheng@xxxxxxxxxxxxxxx>
>   AuthorDate: Thu Nov 6 14:55:45 2008 +0800
>   Commit:     Avi Kivity <avi@xxxxxxxxxx>
>   CommitDate: Tue Nov 11 21:00:37 2008 +0200
>   
>       KVM: VMX: Set IGMT bit in EPT entry
>       
>       There is a potential issue that, when guest using pagetable without vmexit when
>       EPT enabled, guest would use PAT/PCD/PWT bits to index PAT msr for it's memory,
>       which would be inconsistent with host side and would cause host MCE due to
>       inconsistent cache attribute.
>       
>       The patch set IGMT bit in EPT entry to ignore guest PAT and use WB as default
>       memory type to protect host (notice that all memory mapped by KVM should be WB).
> 
> Note the CommitDates!  The AuthorDates strongly suggests Sheng Yang added the whole
> IGMT things as a bug fix for issues that were detected during EPT + VT-d + passthrough
> enabling, but Avi applied it earlier because it was a generic fix.
>
My feeling is that
Current memtype handling for non-coherent DMA is a compromise between
(a) security ("qemu mappings will use writeback and guest mapping will use guest
specified memory types")
(b) the effective memtype cannot be cacheable if guest thinks it's non-cacheable.

So, for MMIOs in non-coherent DMAs, mapping them as UC in EPT is understandable,
because other value like WB or WC is not preferred --
guest usually sets MMIOs' PAT to UC or WC, so "PAT=UC && EPT=WB" or
"PAT=UC && EPT=WC" are not preferred according to SDM due to page aliasing.
And VFIO maps the MMIOs to UC in host.
(With pass-through GPU in my env, the MMIOs' guest MTRR is UC,
 I can observe host hang if I program its EPT type to
 - WB+IPAT or
 - WC
 )

For guest RAM, looks honoring guest MTRRs just mitigates the page aliasing
problem.
E.g. if guest PAT=UC because its MTRR=UC, setting EPT type=UC can avoid
"guest PAT=UC && EPT=WB", which is not recommended in SDM.
But it still breaks (a) if guest PAT is UC.
Also, honoring guest MTRRs in EPT is friendly to old systems that do not enable
PAT. I guess :)
But I agree, in common cases, honoring guest MTRRs or not looks no big difference.
(And I'm not lucky enough to reproduce page-aliasing-caused MCE yet in my
environment).

For CR0_CD=1,
- w/o KVM_X86_QUIRK_CD_NW_CLEARED, it meets (b), but breaks (a).
- w/  KVM_X86_QUIRK_CD_NW_CLEARED, with IPAT=1, it meets (a), but breaks (b);
                                   with IPAT=0, it may breaks (a), but meets (b)

> Jumping back to 0bed3b568b68 ("KVM: Improve MTRR structure"), the other relevant
> code, or rather lack thereof, is the handling of *host* MMIO.  That fix came in a
> bit later, but given the author and timing, I think it's safe to say it was all
> part of the same EPT+VT-d enabling mess.
> 
>   commit 2aaf69dcee864f4fb6402638dd2f263324ac839f
>   Author:     Sheng Yang <sheng@xxxxxxxxxxxxxxx>
>   AuthorDate: Wed Jan 21 16:52:16 2009 +0800
>   Commit:     Avi Kivity <avi@xxxxxxxxxx>
>   CommitDate: Sun Feb 15 02:47:37 2009 +0200
> 
>     KVM: MMU: Map device MMIO as UC in EPT
>     
>     Software are not allow to access device MMIO using cacheable memory type, the
>     patch limit MMIO region with UC and WC(guest can select WC using PAT and
>     PCD/PWT).
> 
> In addition to the host MMIO and IGMT issues, this code was obviously never tested
> on NPT until much later, which lends further credence to my theory/argument that
> this was all the result of misdiagnosed issues.
> 
> Discussion from the EPT+MTRR enabling thread[*] more or less confirms that Sheng
> Yang was trying to resolve issues with passthrough MMIO.
> 
>  * Sheng Yang 
>   : Do you mean host(qemu) would access this memory and if we set it to guest 
>   : MTRR, host access would be broken? We would cover this in our shadow MTRR 
>   : patch, for we encountered this in video ram when doing some experiment with 
>   : VGA assignment. 
> 
> And in the same thread, there's also what appears to be confirmation of Intel
> running into issues with Windows XP related to a guest device driver mapping
> DMA with WC in the PAT.  Hilariously, Avi effectively said "KVM can't modify the
> SPTE memtype to match the guest for EPT/NPT", which while true, completely overlooks
> the fact that EPT and NPT both honor guest PAT by default.  /facepalm

My interpretation is that the since guest PATs are in guest page tables,
while with EPT/NPT, guest page tables are not shadowed, it's not easy to
check guest PATs  to disallow host QEMU access to non-WB guest RAM.

The credence is with Avi's following word:
"Looks like a conflict between the requirements of a hypervisor 
supporting device assignment, and the memory type constraints of mapping 
everything with the same memory type.  As far as I can see, the only 
solution is not to map guest memory in the hypervisor, and do all 
accesses via dma.  This is easy for virtual disk, somewhat harder for 
virtual networking (need a dma engine or a multiqueue device).

Since qemu will only access memory on demand, we don't actually have to 
unmap guest memory, only to ensure that qemu doesn't touch it.  Things 
like live migration and page sharing won't work, but they aren't 
expected to with device assignment anyway."

>  
>  * Avi Kavity
>   : Sheng Yang wrote:
>   : > Yes... But it's easy to do with assigned devices' mmio, but what if guest 
>   : > specific some non-mmio memory's memory type? E.g. we have met one issue in 
>   : > Xen, that a assigned-device's XP driver specific one memory region as buffer, 
>   : > and modify the memory type then do DMA.
>   : >
>   : > Only map MMIO space can be first step, but I guess we can modify assigned 
>   : > memory region memory type follow guest's? 
>   : >   
>   : 
>   : With ept/npt, we can't, since the memory type is in the guest's 
>   : pagetable entries, and these are not accessible
> 
> [*] https://lore.kernel.org/all/1223539317-32379-1-git-send-email-sheng@xxxxxxxxxxxxxxx
> 
> So, for the most part, what I think happened is that 15 years ago, a few engineers
> (a) fixed a #MC problem by ignoring guest PAT and (b) initially "fixed" passthrough
> device MMIO by emulating *guest* MTRRs.  Except for the below case, everything since
> then has been a result of those two intertwined changes.
> 
> The one exception, which is actually yet more confirmation of all of the above,
> is the revert of Paolo's attempt at "full" virtualization of guest MTRRs:
> 
>   commit 606decd67049217684e3cb5a54104d51ddd4ef35
>   Author: Paolo Bonzini <pbonzini@xxxxxxxxxx>
>   Date:   Thu Oct 1 13:12:47 2015 +0200
> 
>     Revert "KVM: x86: apply guest MTRR virtualization on host reserved pages"
>     
>     This reverts commit fd717f11015f673487ffc826e59b2bad69d20fe5.
>     It was reported to cause Machine Check Exceptions (bug 104091).
> 
> ...
> 
>   commit fd717f11015f673487ffc826e59b2bad69d20fe5
>   Author: Paolo Bonzini <pbonzini@xxxxxxxxxx>
>   Date:   Tue Jul 7 14:38:13 2015 +0200
> 
>     KVM: x86: apply guest MTRR virtualization on host reserved pages
>     
>     Currently guest MTRR is avoided if kvm_is_reserved_pfn returns true.
>     However, the guest could prefer a different page type than UC for
>     such pages. A good example is that pass-throughed VGA frame buffer is
>     not always UC as host expected.
>     
>     This patch enables full use of virtual guest MTRRs.
> 
> I.e. Paolo tried to add back KVM's behavior before "Map device MMIO as UC in EPT"
> and got the same result: machine checks, likely due to the guest MTRRs not being
> trustworthy/sane at all times.
> 
> And FWIW, Paolo also tried to enable MTRR virtualization on NP, but that too got
> reverted.  I read through the threads, and AFAICT no one ever found a smoking gun,
> i.e. exactly why emulating guest MTRRs via NPT PAT caused extremely slow boot times
> doesn't appear to have a definitive root cause.
> 
>   commit fc07e76ac7ffa3afd621a1c3858a503386a14281
>   Author: Paolo Bonzini <pbonzini@xxxxxxxxxx>
>   Date:   Thu Oct 1 13:20:22 2015 +0200
> 
>     Revert "KVM: SVM: use NPT page attributes"
>     
>     This reverts commit 3c2e7f7de3240216042b61073803b61b9b3cfb22.
>     Initializing the mapping from MTRR to PAT values was reported to
>     fail nondeterministically, and it also caused extremely slow boot
>     (due to caching getting disabled---bug 103321) with assigned devices.
>
> ...
> 
>   commit 3c2e7f7de3240216042b61073803b61b9b3cfb22
>   Author: Paolo Bonzini <pbonzini@xxxxxxxxxx>
>   Date:   Tue Jul 7 14:32:17 2015 +0200
> 
>     KVM: SVM: use NPT page attributes
>     
>     Right now, NPT page attributes are not used, and the final page
>     attribute depends solely on gPAT (which however is not synced
>     correctly), the guest MTRRs and the guest page attributes.
>     
>     However, we can do better by mimicking what is done for VMX.
>     In the absence of PCI passthrough, the guest PAT can be ignored
>     and the page attributes can be just WB.  If passthrough is being
>     used, instead, keep respecting the guest PAT, and emulate the guest
>     MTRRs through the PAT field of the nested page tables.
>     
>     The only snag is that WP memory cannot be emulated correctly,
>     because Linux's default PAT setting only includes the other types.
> 
> In other words, my reading of the tea leaves is that honoring guest MTRRs for VMX
> was initially a workaround of sorts for KVM ignoring guest PAT *and* for KVM not
> forcing UC for host MMIO.  And while there *are* known cases where honoring guest
> MTRRs is desirable, e.g. passthrough VGA frame buffers, the desired behavior in
> that case is to get WC instead of UC, i.e. at this point it's for performance,
> not correctness.
> 
> Furthermore, the complete absense of MTRR virtualization on NPT and shadow paging
> proves that while KVM theoretically can do better, it's by no means necessary for
> correctnesss.
> 
> Lastly, I would argue that since kernels mostly rely on firmware to do MTRR setup,
> and the host typically provides guest firmware, honoring guest MTRRs is effectively
> honoring *host* userspace memtypes, which is also backwards, i.e. it would be far
> better for host userspace to communicate its desired directly to KVM (or perhaps
> indirectly via VMAs in the host kernel, just not through guest MTRRs).