On Tue, Oct 31, 2023 at 08:14:41AM -0700, Sean Christopherson wrote: > On Tue, Oct 31, 2023, Yan Zhao wrote: > > On Mon, Oct 30, 2023 at 12:24:02PM -0700, Sean Christopherson wrote: > > > On Mon, Oct 30, 2023, Yan Zhao wrote: > > > Digging deeper through the history, this *mostly* appears to be the result of coming > > > to the complete wrong conclusion for handling memtypes during EPT and VT-d enabling. > > ... > > > > Note the CommitDates! The AuthorDates strongly suggests Sheng Yang added the whole > > > IGMT things as a bug fix for issues that were detected during EPT + VT-d + passthrough > > > enabling, but Avi applied it earlier because it was a generic fix. > > > > > My feeling is that > > Current memtype handling for non-coherent DMA is a compromise between > > (a) security ("qemu mappings will use writeback and guest mapping will use guest > > specified memory types") > > (b) the effective memtype cannot be cacheable if guest thinks it's non-cacheable. > > And correctness. E.g. accessing memory with conficting memtypes could cause guest > data corruption, which isn't strictly the same as (a). > > > So, for MMIOs in non-coherent DMAs, mapping them as UC in EPT is understandable, > > because other value like WB or WC is not preferred -- > > guest usually sets MMIOs' PAT to UC or WC, so "PAT=UC && EPT=WB" or > > "PAT=UC && EPT=WC" are not preferred according to SDM due to page aliasing. > > And VFIO maps the MMIOs to UC in host. > > (With pass-through GPU in my env, the MMIOs' guest MTRR is UC, > > I can observe host hang if I program its EPT type to > > - WB+IPAT or > > - WC > > ) > > Yes, but all of that simply confirms that it's KVM's responsibility to map host > MMIO as UC. The hangs you observe likely have nothing to do with memory aliasing, > and everything to do with accessing real MMIO with incompatible memtypes. Yes, you are right. For EPT type = WC, the hang case is actually because pci_iomap() maps PAT as UC- by default, then the effective memory type is WC, which is wrong. If I force the driver to map with PAT=UC, then the driver works normal even with EPT type = WC. > > > For guest RAM, looks honoring guest MTRRs just mitigates the page aliasing > > problem. > > E.g. if guest PAT=UC because its MTRR=UC, setting EPT type=UC can avoid > > "guest PAT=UC && EPT=WB", which is not recommended in SDM. > > But it still breaks (a) if guest PAT is UC. > > Also, honoring guest MTRRs in EPT is friendly to old systems that do not enable > > PAT. I guess :) > > LOL, no way. The PAT can't be disabled, and the default PAT combinations are > backwards compatible with legacy PCD+PWT. The only way for this to provide value > is if someone is virtualizing a pre-Pentium Pro CPU, doing device passthrough, > and *only* doing so on hardware with EPT. > > > But I agree, in common cases, honoring guest MTRRs or not looks no big difference. > > (And I'm not lucky enough to reproduce page-aliasing-caused MCE yet in my > > environment). > > FWIW, I don't think that page aliasing with WC/UC actually causes machine checks. > What does result in #MC (assuming things haven't changed in the last few years) > is accessing MMIO using WB and other cacheable memtypes, e.g. map the host APIC > with WB and you should see #MCs. I suspect this is what people encountered years > ago when KVM attempted to honored guest MTRRs at all times. E.g. the "full" MTRR > virtualization patch that got reverted deliberately allowed the guest to control > the memtype for host MMIO. > > The SDM makes aliasing sound super scary, but then has footnotes where it explicitly > requires the CPU to play nice with aliasing, e.g. if MTRRs are *not* UC but the > effective memtype is UC, then the CPU is *required* to snoop caches: > Yes, I tried below combinations, none of them can trigger #MC. - effective memory type for guest access is WC, and that for host access is UC - effective memory type for guest access is UC, and that for host access is WC - effective memory type for guest access is UC, and that for host access is WB > 2. The UC attribute came from the page-table or page-directory entry and > processors are required to check their caches because the data may be cached > due to page aliasing, which is not recommended. > > Lack of snooping can effectively cause data corruption and ordering issues, but > at least for WC/UC vs. WB I don't think there are actual #MC problems with aliasing. > Even no #MC on guest RAM? E.g. what if guest effective memory type is UC/WC, and host effective memory type is WB? (I tried in my machines with guest PAT=WC + host PAT=WB, looks no #MC, but I'm not sure if anything I'm missing and it's only in my specific environment.) If no #MC, could EPT type of guest RAM also be set to WB (without IPAT) even without non-coherent DMA? > > For CR0_CD=1, > > - w/o KVM_X86_QUIRK_CD_NW_CLEARED, it meets (b), but breaks (a). > > - w/ KVM_X86_QUIRK_CD_NW_CLEARED, with IPAT=1, it meets (a), but breaks (b); > > with IPAT=0, it may breaks (a), but meets (b) > > CR0.CD=1 is a mess above and beyond memtypes. Huh. It's even worse than I thought, > because according to the SDM, Atom CPUs don't support no-fill mode: > > 3. Not supported In Intel Atom processors. If CD = 1 in an Intel Atom processor, > caching is disabled. > > Before I read that blurb about Atom CPUs, what I was going to say is that, AFAIK, > it's *impossible* to accurately virtualize CR0.CD=1 on VMX because there's no way > to emulate no-fill mode. > > > > Discussion from the EPT+MTRR enabling thread[*] more or less confirms that Sheng > > > Yang was trying to resolve issues with passthrough MMIO. > > > > > > * Sheng Yang > > > : Do you mean host(qemu) would access this memory and if we set it to guest > > > : MTRR, host access would be broken? We would cover this in our shadow MTRR > > > : patch, for we encountered this in video ram when doing some experiment with > > > : VGA assignment. > > > > > > And in the same thread, there's also what appears to be confirmation of Intel > > > running into issues with Windows XP related to a guest device driver mapping > > > DMA with WC in the PAT. Hilariously, Avi effectively said "KVM can't modify the > > > SPTE memtype to match the guest for EPT/NPT", which while true, completely overlooks > > > the fact that EPT and NPT both honor guest PAT by default. /facepalm > > > > My interpretation is that the since guest PATs are in guest page tables, > > while with EPT/NPT, guest page tables are not shadowed, it's not easy to > > check guest PATs to disallow host QEMU access to non-WB guest RAM. > > Ah, yeah, your interpretation makes sense. > > The best idea I can think of to support things like this is to have KVM grab the > effective PAT memtype from the host userspace page tables, shove that into the > EPT/NPT memtype, and then ignore guest PAT. I don't if that would actually work > though. Hmm, it might not work. E.g. in GPU, some MMIOs are mapped as UC-, while some others as WC, even they belong to the same BAR. I don't think host can know which one to choose in advance. I think it should be also true to RAM range, guest can do memremap to a memory type that host doesn't know beforehand. > > > The credence is with Avi's following word: > > "Looks like a conflict between the requirements of a hypervisor > > supporting device assignment, and the memory type constraints of mapping > > everything with the same memory type. As far as I can see, the only > > solution is not to map guest memory in the hypervisor, and do all > > accesses via dma. This is easy for virtual disk, somewhat harder for > > virtual networking (need a dma engine or a multiqueue device). > > > > Since qemu will only access memory on demand, we don't actually have to > > unmap guest memory, only to ensure that qemu doesn't touch it. Things > > like live migration and page sharing won't work, but they aren't > > expected to with device assignment anyway."