On Sat, Mar 01, 2025 at 02:34:27AM -0500, Paolo Bonzini wrote: > From: Yan Zhao <yan.y.zhao@xxxxxxxxx> > > Introduce an Intel specific quirk KVM_X86_QUIRK_EPT_IGNORE_GUEST_PAT to have > KVM ignore guest PAT when this quirk is enabled. > > KVM is able to safely honor guest PAT on Intel platforms when CPU feature > self-snoop is supported. However, KVM honoring guest PAT was reverted after > commit 9d70f3fec144 ("Revert "KVM: VMX: Always honor guest PAT on CPUs that > support self-snoop""), due to UC access on certain Intel platforms being > very slow [1]. Honoring guest PAT on those platforms may break some old > guests that accidentally specify PAT as UC. Those old guests may never > expect the slowness since KVM always forces WB previously. See [2]. > > So, introduce an Intel specific quirk KVM_X86_QUIRK_EPT_IGNORE_GUEST_PAT. > KVM enables the quirk on all Intel platforms by default to avoid breaking > old unmodifiable guests. Newer userspace can disable this quirk to turn on > honoring guest PAT. > > The quirk is only valid on Intel's platforms and is absent on AMD's > platforms as KVM always honors guest PAT when running on AMD. > > Suggested-by: Paolo Bonzini <pbonzini@xxxxxxxxxx> > Suggested-by: Sean Christopherson <seanjc@xxxxxxxxxx> > Cc: Kevin Tian <kevin.tian@xxxxxxxxx> > Signed-off-by: Yan Zhao <yan.y.zhao@xxxxxxxxx> > Link: https://lore.kernel.org/all/Ztl9NWCOupNfVaCA@xxxxxxxxxxxxxxxxxxxxxxxxx # [1] > Link: https://lore.kernel.org/all/87jzfutmfc.fsf@xxxxxxxxxx # [2] > Message-ID: <20250224070946.31482-1-yan.y.zhao@xxxxxxxxx> > Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx> > --- > Documentation/virt/kvm/api.rst | 22 +++++++++++++++++++ > arch/x86/include/uapi/asm/kvm.h | 1 + > arch/x86/kvm/mmu.h | 2 +- > arch/x86/kvm/mmu/mmu.c | 11 ++++++---- > arch/x86/kvm/svm/svm.c | 1 + > arch/x86/kvm/vmx/vmx.c | 39 +++++++++++++++++++++++++++------ > arch/x86/kvm/x86.c | 2 +- > 7 files changed, 65 insertions(+), 13 deletions(-) > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > index 2d75edc9db4f..1f13e47a65fa 100644 > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -8157,6 +8157,28 @@ KVM_X86_QUIRK_STUFF_FEATURE_MSRS By default, at vCPU creation, KVM sets the > and 0x489), as KVM does now allow them to > be set by userspace (KVM sets them based on > guest CPUID, for safety purposes). > + > +KVM_X86_QUIRK_EPT_IGNORE_GUEST_PAT By default, on Intel platforms, KVM ignores > + guest PAT and forces the effective memory > + type to WB in EPT. The quirk is not available > + on Intel platforms which are incapable of > + safely honoring guest PAT (i.e., without CPU > + self-snoop, KVM always ignores guest PAT and > + forces effective memory type to WB). It is > + also ignored on AMD platforms or, on Intel, > + when a VM has non-coherent DMA devices > + assigned; KVM always honors guest PAT in > + such case. The quirk is needed to avoid > + slowdowns on certain Intel Xeon platforms > + (e.g. ICX, SPR) where self-snoop feature is > + supported but UC is slow enough to cause > + issues with some older guests that use > + UC instead of WC to map the video RAM. > + Userspace can disable the quirk to honor > + guest PAT if it knows that there is no such > + guest software, for example if it does not > + expose a bochs graphics device (which is > + known to have had a buggy driver). > =================================== ============================================ > > 7.32 KVM_CAP_MAX_VCPU_ID > diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h > index 89cc7a18ef45..db55a70e173c 100644 > --- a/arch/x86/include/uapi/asm/kvm.h > +++ b/arch/x86/include/uapi/asm/kvm.h > @@ -441,6 +441,7 @@ struct kvm_sync_regs { > #define KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS (1 << 6) > #define KVM_X86_QUIRK_SLOT_ZAP_ALL (1 << 7) > #define KVM_X86_QUIRK_STUFF_FEATURE_MSRS (1 << 8) > +#define KVM_X86_QUIRK_EPT_IGNORE_GUEST_PAT (1 << 9) > > #define KVM_STATE_NESTED_FORMAT_VMX 0 > #define KVM_STATE_NESTED_FORMAT_SVM 1 > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h > index 47e64a3c4ce3..f999c15d8d3e 100644 > --- a/arch/x86/kvm/mmu.h > +++ b/arch/x86/kvm/mmu.h > @@ -232,7 +232,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, > return -(u32)fault & errcode; > } > > -bool kvm_mmu_may_ignore_guest_pat(void); > +bool kvm_mmu_may_ignore_guest_pat(struct kvm *kvm); > > int kvm_mmu_post_init_vm(struct kvm *kvm); > void kvm_mmu_pre_destroy_vm(struct kvm *kvm); > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index e6eb3a262f8d..bcf395d7ec53 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4663,17 +4663,20 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, > } > #endif > > -bool kvm_mmu_may_ignore_guest_pat(void) > +bool kvm_mmu_may_ignore_guest_pat(struct kvm *kvm) > { > /* > * When EPT is enabled (shadow_memtype_mask is non-zero), and the VM > * has non-coherent DMA (DMA doesn't snoop CPU caches), KVM's ABI is to > * honor the memtype from the guest's PAT so that guest accesses to > * memory that is DMA'd aren't cached against the guest's wishes. As a > - * result, KVM _may_ ignore guest PAT, whereas without non-coherent DMA, > - * KVM _always_ ignores guest PAT (when EPT is enabled). > + * result, KVM _may_ ignore guest PAT, whereas without non-coherent DMA. > + * KVM _always_ ignores guest PAT, when EPT is enabled and when quirk > + * KVM_X86_QUIRK_EPT_IGNORE_GUEST_PAT is enabled or the CPU lacks the > + * ability to safely honor guest PAT. > */ > - return shadow_memtype_mask; > + return shadow_memtype_mask && > + kvm_check_has_quirk(kvm, KVM_X86_QUIRK_EPT_IGNORE_GUEST_PAT); > } > > int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c > index ebaa5a41db07..2254dbebddac 100644 > --- a/arch/x86/kvm/svm/svm.c > +++ b/arch/x86/kvm/svm/svm.c > @@ -5426,6 +5426,7 @@ static __init int svm_hardware_setup(void) > */ > allow_smaller_maxphyaddr = !npt_enabled; > > + kvm_caps.inapplicable_quirks |= KVM_X86_QUIRK_EPT_IGNORE_GUEST_PAT; Similar to KVM_X86_QUIRK_CD_NW_CLEARED for Intel, KVM_X86_QUIRK_EPT_IGNORE_GUEST_PAT has actually no effect on AMD's platforms, even if kvm_check_has_quirk() returns true. I'm wondering if it's really necessary for us to introuce kvm_caps.inapplicable_quirks. > return 0; > > err: > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index 75df4caea2f7..5365efb22e96 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -7599,6 +7599,33 @@ int vmx_vm_init(struct kvm *kvm) > return 0; > } > > +/* > + * Ignore guest PAT when the CPU doesn't support self-snoop to safely honor > + * guest PAT, or quirk KVM_X86_QUIRK_EPT_IGNORE_GUEST_PAT is turned on. Always > + * honor guest PAT when there's non-coherent DMA device attached. > + * > + * Honoring guest PAT means letting the guest control memory types. > + * - On Intel CPUs that lack self-snoop feature, honoring guest PAT may result > + * in unexpected behavior. So always ignore guest PAT on those CPUs. > + * > + * - KVM's ABI is to trust the guest for attached non-coherent DMA devices to > + * function correctly (non-coherent DMA devices need the guest to flush CPU > + * caches properly). So honoring guest PAT to avoid breaking existing ABI. > + * > + * - On certain Intel CPUs (e.g. SPR, ICX), though self-snoop feature is > + * supported, UC is slow enough to cause issues with some older guests (e.g. > + * an old version of bochs driver uses ioremap() instead of ioremap_wc() to > + * map the video RAM, causing wayland desktop to fail to get started > + * correctly). To avoid breaking those old guests that rely on KVM to force > + * memory type to WB, only honoring guest PAT when quirk > + * KVM_X86_QUIRK_EPT_IGNORE_GUEST_PAT is disabled. > + */ > +static inline bool vmx_ignore_guest_pat(struct kvm *kvm) > +{ > + return !kvm_arch_has_noncoherent_dma(kvm) && > + kvm_check_has_quirk(kvm, KVM_X86_QUIRK_EPT_IGNORE_GUEST_PAT); > +} > + > u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) > { > /* > @@ -7608,13 +7635,8 @@ u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) > if (is_mmio) > return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT; > > - /* > - * Force WB and ignore guest PAT if the VM does NOT have a non-coherent > - * device attached. Letting the guest control memory types on Intel > - * CPUs may result in unexpected behavior, and so KVM's ABI is to trust > - * the guest to behave only as a last resort. > - */ > - if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) > + /* Force WB if ignoring guest PAT */ > + if (vmx_ignore_guest_pat(vcpu->kvm)) > return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; > > return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT); > @@ -8506,6 +8528,9 @@ __init int vmx_hardware_setup(void) > > kvm_set_posted_intr_wakeup_handler(pi_wakeup_handler); > > + /* Must use WB if the CPU does not have self-snoop. */ > + if (!static_cpu_has(X86_FEATURE_SELFSNOOP)) > + kvm_caps.supported_quirks &= ~KVM_X86_QUIRK_EPT_IGNORE_GUEST_PAT; > kvm_caps.inapplicable_quirks = KVM_X86_QUIRK_CD_NW_CLEARED; > return r; > } > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index a97e58916b6a..b221f273ec77 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -13544,7 +13544,7 @@ static void kvm_noncoherent_dma_assignment_start_or_stop(struct kvm *kvm) > * (or last) non-coherent device is (un)registered to so that new SPTEs > * with the correct "ignore guest PAT" setting are created. > */ > - if (kvm_mmu_may_ignore_guest_pat()) > + if (kvm_mmu_may_ignore_guest_pat(kvm)) > kvm_zap_gfn_range(kvm, gpa_to_gfn(0), gpa_to_gfn(~0ULL)); > } > > -- > 2.43.5 > >