On Tue, Mar 04, 2025 at 01:06:45AM -0500, Paolo Bonzini wrote: > From: Yan Zhao <yan.y.zhao@xxxxxxxxx> > > Introduce an Intel specific quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT to have > KVM ignore guest PAT when this quirk is enabled. > > On AMD platforms, KVM always honors guest PAT. On Intel however there are > two issues. First, KVM *cannot* honor guest PAT if CPU feature self-snoop > is not supported. Second, UC access on certain Intel platforms can be very > slow[1] and honoring guest PAT on those platforms may break some old > guests that accidentally specify video RAM as UC. Those old guests may > never expect the slowness since KVM always forces WB previously. See [2]. > > So, introduce a quirk that KVM can enable by default on all Intel platforms > to avoid breaking old unmodifiable guests. Newer userspace can disable this > quirk if it wishes KVM to honor guest PAT; disabling the quirk will fail > if self-snoop is not supported, i.e. if KVM cannot obey the wish. > > The quirk is a no-op on AMD and also if any assigned devices have > non-coherent DMA. This is not an issue, as KVM_X86_QUIRK_CD_NW_CLEARED is > another example of a quirk that is sometimes automatically disabled. > > Suggested-by: Paolo Bonzini <pbonzini@xxxxxxxxxx> > Suggested-by: Sean Christopherson <seanjc@xxxxxxxxxx> > Cc: Kevin Tian <kevin.tian@xxxxxxxxx> > Signed-off-by: Yan Zhao <yan.y.zhao@xxxxxxxxx> > Link: https://lore.kernel.org/all/Ztl9NWCOupNfVaCA@xxxxxxxxxxxxxxxxxxxxxxxxx # [1] > Link: https://lore.kernel.org/all/87jzfutmfc.fsf@xxxxxxxxxx # [2] > Message-ID: <20250224070946.31482-1-yan.y.zhao@xxxxxxxxx> > [Use supported_quirks/inapplicable_quirks to support both AMD and > no-self-snoop cases, as well as to remove the shadow_memtype_mask check > from kvm_mmu_may_ignore_guest_pat(). - Paolo] > Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx> > --- > Documentation/virt/kvm/api.rst | 22 ++++++++++++++++++ > arch/x86/include/asm/kvm_host.h | 6 +++-- > arch/x86/include/uapi/asm/kvm.h | 1 + > arch/x86/kvm/mmu.h | 2 +- > arch/x86/kvm/mmu/mmu.c | 10 ++++---- > arch/x86/kvm/vmx/vmx.c | 41 +++++++++++++++++++++++++++------ > arch/x86/kvm/x86.c | 2 +- > 7 files changed, 69 insertions(+), 15 deletions(-) > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > index 2d75edc9db4f..452439b605af 100644 > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -8157,6 +8157,28 @@ KVM_X86_QUIRK_STUFF_FEATURE_MSRS By default, at vCPU creation, KVM sets the > and 0x489), as KVM does now allow them to > be set by userspace (KVM sets them based on > guest CPUID, for safety purposes). > + > +KVM_X86_QUIRK_IGNORE_GUEST_PAT By default, on Intel platforms, KVM ignores > + guest PAT and forces the effective memory > + type to WB in EPT. The quirk is not available > + on Intel platforms which are incapable of > + safely honoring guest PAT (i.e., without CPU > + self-snoop, KVM always ignores guest PAT and > + forces effective memory type to WB). It is Not sure if it's necessary to add something like: The quirk is also not available on Intel platforms which do not enable EPT (i.e., in the shadow paging case, KVM always ignores guest PAT). > + also ignored on AMD platforms or, on Intel, > + when a VM has non-coherent DMA devices > + assigned; KVM always honors guest PAT in > + such case. The quirk is needed to avoid > + slowdowns on certain Intel Xeon platforms > + (e.g. ICX, SPR) where self-snoop feature is > + supported but UC is slow enough to cause > + issues with some older guests that use > + UC instead of WC to map the video RAM. > + Userspace can disable the quirk to honor > + guest PAT if it knows that there is no such > + guest software, for example if it does not > + expose a bochs graphics device (which is > + known to have had a buggy driver). > =================================== ============================================ > > 7.32 KVM_CAP_MAX_VCPU_ID > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index a4f213d235dd..9b9dde476f3c 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -2418,10 +2418,12 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages); > KVM_X86_QUIRK_FIX_HYPERCALL_INSN | \ > KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS | \ > KVM_X86_QUIRK_SLOT_ZAP_ALL | \ > - KVM_X86_QUIRK_STUFF_FEATURE_MSRS) > + KVM_X86_QUIRK_STUFF_FEATURE_MSRS | \ > + KVM_X86_QUIRK_IGNORE_GUEST_PAT) > > #define KVM_X86_CONDITIONAL_QUIRKS \ > - KVM_X86_QUIRK_CD_NW_CLEARED > + (KVM_X86_QUIRK_CD_NW_CLEARED | \ > + KVM_X86_QUIRK_IGNORE_GUEST_PAT) > > /* > * KVM previously used a u32 field in kvm_run to indicate the hypercall was > diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h > index 89cc7a18ef45..dc4d6428dd02 100644 > --- a/arch/x86/include/uapi/asm/kvm.h > +++ b/arch/x86/include/uapi/asm/kvm.h > @@ -441,6 +441,7 @@ struct kvm_sync_regs { > #define KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS (1 << 6) > #define KVM_X86_QUIRK_SLOT_ZAP_ALL (1 << 7) > #define KVM_X86_QUIRK_STUFF_FEATURE_MSRS (1 << 8) > +#define KVM_X86_QUIRK_IGNORE_GUEST_PAT (1 << 9) > > #define KVM_STATE_NESTED_FORMAT_VMX 0 > #define KVM_STATE_NESTED_FORMAT_SVM 1 > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h > index 47e64a3c4ce3..f999c15d8d3e 100644 > --- a/arch/x86/kvm/mmu.h > +++ b/arch/x86/kvm/mmu.h > @@ -232,7 +232,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, > return -(u32)fault & errcode; > } > > -bool kvm_mmu_may_ignore_guest_pat(void); > +bool kvm_mmu_may_ignore_guest_pat(struct kvm *kvm); > > int kvm_mmu_post_init_vm(struct kvm *kvm); > void kvm_mmu_pre_destroy_vm(struct kvm *kvm); > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index e6eb3a262f8d..9d6294f76d19 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4663,17 +4663,19 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, > } > #endif > > -bool kvm_mmu_may_ignore_guest_pat(void) > +bool kvm_mmu_may_ignore_guest_pat(struct kvm *kvm) > { > /* > * When EPT is enabled (shadow_memtype_mask is non-zero), and the VM > * has non-coherent DMA (DMA doesn't snoop CPU caches), KVM's ABI is to > * honor the memtype from the guest's PAT so that guest accesses to > * memory that is DMA'd aren't cached against the guest's wishes. As a > - * result, KVM _may_ ignore guest PAT, whereas without non-coherent DMA, > - * KVM _always_ ignores guest PAT (when EPT is enabled). > + * result, KVM _may_ ignore guest PAT, whereas without non-coherent DMA. > + * KVM _always_ ignores guest PAT, when EPT is enabled and when quirk > + * KVM_X86_QUIRK_IGNORE_GUEST_PAT is enabled or the CPU lacks the > + * ability to safely honor guest PAT. > */ > - return shadow_memtype_mask; > + return kvm_check_has_quirk(kvm, KVM_X86_QUIRK_IGNORE_GUEST_PAT); This changes the original logic for shadow paging. But maybe it's benign as the point in [1]. [1] https://lore.kernel.org/all/Z8bbKCICpzBKyVBT@xxxxxxxxxxxxxxxxxxxxxxxxx/ > } > > int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index 486fbdb4365c..719e79712339 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -7599,6 +7599,17 @@ int vmx_vm_init(struct kvm *kvm) > return 0; > } > > +static inline bool vmx_ignore_guest_pat(struct kvm *kvm) > +{ > + /* > + * Non-coherent DMA devices need the guest to flush CPU properly. > + * In that case it is not possible to map all guest RAM as WB, so > + * always trust guest PAT. > + */ > + return !kvm_arch_has_noncoherent_dma(kvm) && > + kvm_check_has_quirk(kvm, KVM_X86_QUIRK_IGNORE_GUEST_PAT); > +} > + > u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) > { > /* > @@ -7608,13 +7619,8 @@ u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) > if (is_mmio) > return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT; > > - /* > - * Force WB and ignore guest PAT if the VM does NOT have a non-coherent > - * device attached. Letting the guest control memory types on Intel > - * CPUs may result in unexpected behavior, and so KVM's ABI is to trust > - * the guest to behave only as a last resort. > - */ > - if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) > + /* Force WB if ignoring guest PAT */ > + if (vmx_ignore_guest_pat(vcpu->kvm)) > return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; > > return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT); > @@ -8506,6 +8512,27 @@ __init int vmx_hardware_setup(void) > > kvm_set_posted_intr_wakeup_handler(pi_wakeup_handler); > > + /* > + * On Intel CPUs that lack self-snoop feature, letting the guest control > + * memory types may result in unexpected behavior. So always ignore guest > + * PAT on those CPUs and map VM as writeback, not allowing userspace to > + * disable the quirk. > + * > + * On certain Intel CPUs (e.g. SPR, ICX), though self-snoop feature is > + * supported, UC is slow enough to cause issues with some older guests (e.g. > + * an old version of bochs driver uses ioremap() instead of ioremap_wc() to > + * map the video RAM, causing wayland desktop to fail to get started > + * correctly). To avoid breaking those older guests that rely on KVM to force > + * memory type to WB, provide KVM_X86_QUIRK_IGNORE_GUEST_PAT to preserve the > + * safer (for performance) default behavior. > + * > + * On top of this, non-coherent DMA devices need the guest to flush CPU > + * caches properly. This also requires honoring guest PAT, and is forced > + * independent of the quirk in vmx_ignore_guest_pat(). > + */ > + if (!static_cpu_has(X86_FEATURE_SELFSNOOP)) What about if (!static_cpu_has(X86_FEATURE_SELFSNOOP) || !enable_ept) ? > + kvm_caps.supported_quirks &= ~KVM_X86_QUIRK_IGNORE_GUEST_PAT; > + kvm_caps.inapplicable_quirks &= ~KVM_X86_QUIRK_IGNORE_GUEST_PAT; > return r; > } > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 062c1b58b223..5b45fca3ddfa 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -13545,7 +13545,7 @@ static void kvm_noncoherent_dma_assignment_start_or_stop(struct kvm *kvm) > * (or last) non-coherent device is (un)registered to so that new SPTEs > * with the correct "ignore guest PAT" setting are created. > */ > - if (kvm_mmu_may_ignore_guest_pat()) > + if (kvm_mmu_may_ignore_guest_pat(kvm)) > kvm_zap_gfn_range(kvm, gpa_to_gfn(0), gpa_to_gfn(~0ULL)); > } > > -- > 2.43.5 > >