Re: A question about how the KVM emulates the effect of guest MTRRs on AMD platforms

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Oct 31, 2023 at 08:14:41AM -0700, Sean Christopherson wrote:
> On Tue, Oct 31, 2023, Yan Zhao wrote:
> > On Mon, Oct 30, 2023 at 12:24:02PM -0700, Sean Christopherson wrote:
> > > On Mon, Oct 30, 2023, Yan Zhao wrote:
> > > Digging deeper through the history, this *mostly* appears to be the result of coming
> > > to the complete wrong conclusion for handling memtypes during EPT and VT-d enabling.
> 
> ...
> 
> > > Note the CommitDates!  The AuthorDates strongly suggests Sheng Yang added the whole
> > > IGMT things as a bug fix for issues that were detected during EPT + VT-d + passthrough
> > > enabling, but Avi applied it earlier because it was a generic fix.
> > >
> > My feeling is that
> > Current memtype handling for non-coherent DMA is a compromise between
> > (a) security ("qemu mappings will use writeback and guest mapping will use guest
> > specified memory types")
> > (b) the effective memtype cannot be cacheable if guest thinks it's non-cacheable.
> 
> And correctness.  E.g. accessing memory with conficting memtypes could cause guest
> data corruption, which isn't strictly the same as (a).
> 
> > So, for MMIOs in non-coherent DMAs, mapping them as UC in EPT is understandable,
> > because other value like WB or WC is not preferred --
> > guest usually sets MMIOs' PAT to UC or WC, so "PAT=UC && EPT=WB" or
> > "PAT=UC && EPT=WC" are not preferred according to SDM due to page aliasing.
> > And VFIO maps the MMIOs to UC in host.
> > (With pass-through GPU in my env, the MMIOs' guest MTRR is UC,
> >  I can observe host hang if I program its EPT type to
> >  - WB+IPAT or
> >  - WC
> >  )
> 
> Yes, but all of that simply confirms that it's KVM's responsibility to map host
> MMIO as UC.  The hangs you observe likely have nothing to do with memory aliasing,
> and everything to do with accessing real MMIO with incompatible memtypes.
Yes, you are right.
For EPT type = WC, the hang case is actually because pci_iomap() maps PAT
as UC- by default, then the effective memory type is WC, which is wrong.
If I force the driver to map with PAT=UC, then the driver works normal even
with EPT type = WC.

> 
> > For guest RAM, looks honoring guest MTRRs just mitigates the page aliasing
> > problem.
> > E.g. if guest PAT=UC because its MTRR=UC, setting EPT type=UC can avoid
> > "guest PAT=UC && EPT=WB", which is not recommended in SDM.
> > But it still breaks (a) if guest PAT is UC.
> > Also, honoring guest MTRRs in EPT is friendly to old systems that do not enable
> > PAT. I guess :)
> 
> LOL, no way.  The PAT can't be disabled, and the default PAT combinations are
> backwards compatible with legacy PCD+PWT.  The only way for this to provide value
> is if someone is virtualizing a pre-Pentium Pro CPU, doing device passthrough,
> and *only* doing so on hardware with EPT.
> 
> > But I agree, in common cases, honoring guest MTRRs or not looks no big difference.
> > (And I'm not lucky enough to reproduce page-aliasing-caused MCE yet in my
> > environment).
> 
> FWIW, I don't think that page aliasing with WC/UC actually causes machine checks.
> What does result in #MC (assuming things haven't changed in the last few years)
> is accessing MMIO using WB and other cacheable memtypes, e.g. map the host APIC
> with WB and you should see #MCs.  I suspect this is what people encountered years
> ago when KVM attempted to honored guest MTRRs at all times.  E.g. the "full" MTRR
> virtualization patch that got reverted deliberately allowed the guest to control
> the memtype for host MMIO.
> 
> The SDM makes aliasing sound super scary, but then has footnotes where it explicitly
> requires the CPU to play nice with aliasing, e.g. if MTRRs are *not* UC but the
> effective memtype is UC, then the CPU is *required* to snoop caches:
>
Yes, I tried below combinations, none of them can trigger #MC.
- effective memory type for guest access is WC, and that for host access is UC
- effective memory type for guest access is UC, and that for host access is WC
- effective memory type for guest access is UC, and that for host access is WB


>   2. The UC attribute came from the page-table or page-directory entry and
>      processors are required to check their caches because the data may be cached
>      due to page aliasing, which is not recommended.
> 
> Lack of snooping can effectively cause data corruption and ordering issues, but
> at least for WC/UC vs. WB I don't think there are actual #MC problems with aliasing.
> 
Even no #MC on guest RAM?
E.g. what if guest effective memory type is UC/WC, and host effective memory type
is WB?
(I tried in my machines with guest PAT=WC + host PAT=WB, looks no #MC, but I'm not sure
if anything I'm missing and it's only in my specific environment.)

If no #MC, could EPT type of guest RAM also be set to WB (without IPAT) even
without non-coherent DMA?

> > For CR0_CD=1,
> > - w/o KVM_X86_QUIRK_CD_NW_CLEARED, it meets (b), but breaks (a).
> > - w/  KVM_X86_QUIRK_CD_NW_CLEARED, with IPAT=1, it meets (a), but breaks (b);
> >                                    with IPAT=0, it may breaks (a), but meets (b)
> 
> CR0.CD=1 is a mess above and beyond memtypes.  Huh.  It's even worse than I thought,
> because according to the SDM, Atom CPUs don't support no-fill mode:
> 
>   3. Not supported In Intel Atom processors. If CD = 1 in an Intel Atom processor,
>      caching is disabled.
> 
> Before I read that blurb about Atom CPUs, what I was going to say is that, AFAIK,
> it's *impossible* to accurately virtualize CR0.CD=1 on VMX because there's no way
> to emulate no-fill mode.
> 
> > > Discussion from the EPT+MTRR enabling thread[*] more or less confirms that Sheng
> > > Yang was trying to resolve issues with passthrough MMIO.
> > > 
> > >  * Sheng Yang 
> > >   : Do you mean host(qemu) would access this memory and if we set it to guest 
> > >   : MTRR, host access would be broken? We would cover this in our shadow MTRR 
> > >   : patch, for we encountered this in video ram when doing some experiment with 
> > >   : VGA assignment. 
> > > 
> > > And in the same thread, there's also what appears to be confirmation of Intel
> > > running into issues with Windows XP related to a guest device driver mapping
> > > DMA with WC in the PAT.  Hilariously, Avi effectively said "KVM can't modify the
> > > SPTE memtype to match the guest for EPT/NPT", which while true, completely overlooks
> > > the fact that EPT and NPT both honor guest PAT by default.  /facepalm
> > 
> > My interpretation is that the since guest PATs are in guest page tables,
> > while with EPT/NPT, guest page tables are not shadowed, it's not easy to
> > check guest PATs  to disallow host QEMU access to non-WB guest RAM.
> 
> Ah, yeah, your interpretation makes sense.
> 
> The best idea I can think of to support things like this is to have KVM grab the
> effective PAT memtype from the host userspace page tables, shove that into the
> EPT/NPT memtype, and then ignore guest PAT.  I don't if that would actually work
> though.
Hmm, it might not work. E.g. in GPU, some MMIOs are mapped as UC-, while some
others as WC, even they belong to the same BAR.
I don't think host can know which one to choose in advance.
I think it should be also true to RAM range, guest can do memremap to a memory
type that host doesn't know beforehand.

> 
> > The credence is with Avi's following word:
> > "Looks like a conflict between the requirements of a hypervisor 
> > supporting device assignment, and the memory type constraints of mapping 
> > everything with the same memory type.  As far as I can see, the only 
> > solution is not to map guest memory in the hypervisor, and do all 
> > accesses via dma.  This is easy for virtual disk, somewhat harder for 
> > virtual networking (need a dma engine or a multiqueue device).
> > 
> > Since qemu will only access memory on demand, we don't actually have to 
> > unmap guest memory, only to ensure that qemu doesn't touch it.  Things 
> > like live migration and page sharing won't work, but they aren't 
> > expected to with device assignment anyway."




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux