Re: [PATCH] x86/kvm: Override default caching mode for SEV-SNP and TDX

Sean Christopherson <seanjc@xxxxxxxxxx> · Fri, 24 Jan 2025 12:59:50 -0800

On Tue, Oct 15, 2024, Kirill A. Shutemov wrote:
> AMD SEV-SNP and Intel TDX have limited access to MTRR: either it is not
> advertised in CPUID or it cannot be programmed (on TDX, due to #VE on
> CR0.CD clear).
> 
> This results in guests using uncached mappings where it shouldn't and
> pmd/pud_set_huge() failures due to non-uniform memory type reported by
> mtrr_type_lookup().
> 
> Override MTRR state, making it WB by default as the kernel does for
> Hyper-V guests.

In a turn of events that should have shocked no one, simply overriding the default
memtype also results in breakage.

The insanity that is ACPI relies on UC MTRR mappings to force memory that is very
obviously not RAM to use non-WB mappings.  During acpi_init(), acpi_os_map_iomem()
will attempt to map devices as WB:

  static inline void __iomem *acpi_os_ioremap(acpi_physical_address phys,
					      acpi_size size)
  {
       return ioremap_cache(phys, size);
  }

If acpi_init() runs before the corresponding device driver is probed, ACPI's WB
mapping will "win", and result in the driver's ioremap() failing because the
existing WB mapping isn't compatible with the requested UC-.

[    1.730459] ioremap error for 0xfed40000-0xfed45000, requested 0x2, got 0x0
[    1.732780] tpm_tis MSFT0101:00: probe with driver tpm_tis failed with error -12


Note, the '0x2' and '0x0' values refer to "enum page_cache_mode", not x86's
memtypes (which frustratingly are an almost pure inversion; 2 == WB, 0 == UC).

The above trace is from a Google-VMM based VM, but the same behavior happens with
a QEMU based VM, so unless QEMU is also building bad ACPI tables, I don't think
this is a VMM problem.

E.g. tracing mapping requests for HPET under QEMU yields

   Mapping HPET, req_type = 0
   WARNING: CPU: 5 PID: 1 at arch/x86/mm/pat/memtype.c:528 memtype_reserve+0x22f/0x3f0
   Call Trace:
    <TASK>
    __ioremap_caller.constprop.0+0xd6/0x330
    acpi_os_map_iomem+0x195/0x1b0
    acpi_ex_system_memory_space_handler+0x11c/0x2f0
    acpi_ev_address_space_dispatch+0x168/0x3b0
    acpi_ex_access_region+0xd7/0x280
    acpi_ex_field_datum_io+0x73/0x210
    acpi_ex_extract_from_field+0x267/0x2a0
    acpi_ex_read_data_from_field+0x8e/0x220
    acpi_ex_resolve_node_to_value+0xe2/0x2b0
    acpi_ds_evaluate_name_path+0xa9/0x100
    acpi_ds_exec_end_op+0x21f/0x4c0
    acpi_ps_parse_loop+0xf4/0x670
    acpi_ps_parse_aml+0x17b/0x3d0
    acpi_ps_execute_method+0x137/0x260
    acpi_ns_evaluate+0x1f0/0x2d0
    acpi_evaluate_object+0x13d/0x2e0
    acpi_evaluate_integer+0x50/0xe0
    acpi_bus_get_status+0x7b/0x140
    acpi_add_single_object+0x3f8/0x750
    acpi_bus_check_add+0xb2/0x340
    acpi_ns_walk_namespace+0xfe/0x200
    acpi_walk_namespace+0xbb/0xe0
    acpi_bus_scan+0x1b5/0x1d0
    acpi_scan_init+0xd5/0x290
    acpi_init+0x1fc/0x520
    do_one_initcall+0x41/0x1d0
    kernel_init_freeable+0x164/0x260
    kernel_init+0x16/0x1a0
    ret_from_fork+0x2d/0x50
    ret_from_fork_asm+0x11/0x20
   ---[ end trace 0000000000000000 ]---

The only reason this doesn't cause problems for HPET is because HPET gets special
treatment via x86_init.timers.timer_init(), and so gets a chance to creat its UC-
mapping before acpi_init() clobbers things.

E.g. modifying the horrid CoCo MTRR hack to apply to all VM types, and then disabling
the early call to hpet_time_init() yields the same behavior for HPET:

[    0.318264] ioremap error for 0xfed00000-0xfed01000, requested 0x2, got 0x0

---

diff --git a/arch/x86/kernel/cpu/mtrr/generic.c b/arch/x86/kernel/cpu/mtrr/generic.c
index 7b29ebda024f..8b58024611e5 100644
--- a/arch/x86/kernel/cpu/mtrr/generic.c
+++ b/arch/x86/kernel/cpu/mtrr/generic.c
@@ -456,11 +456,13 @@ void mtrr_overwrite_state(struct mtrr_var_range *var, unsigned int num_var,
         * - when running as SEV-SNP or TDX guest to avoid unnecessary
         *   VMM communication/Virtualization exceptions (#VC, #VE)
         */
+#if 0
        if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP) &&
            !hv_is_isolation_supported() &&
            !cpu_feature_enabled(X86_FEATURE_XENPV) &&
            !cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
                return;
+#endif
 
        /* Disable MTRR in order to disable MTRR modifications. */
        setup_clear_cpu_cap(X86_FEATURE_MTRR);
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 21e9e4845354..11f08aba1b8c 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -982,6 +982,8 @@ static void __init kvm_init_platform(void)
        kvmclock_init();
        x86_platform.apic_post_init = kvm_apic_init;
 
+       x86_init.timers.timer_init = x86_init_noop;
+
        /* Set WB as the default cache mode for SEV-SNP and TDX */
        mtrr_overwrite_state(NULL, 0, MTRR_TYPE_WRBACK);
 }
---

This is all honestly beyond ridiculous.  As of commit 0a7b73559b39 ("KVM: x86:
Remove VMX support for virtualizing guest MTRR memtypes"), MTRRs are never
virtualized under KVM, i.e. the memtype in the MTRR has *zero* impact on the
effective memtype.  And even before that commit, KVM only virtualized MTRRs for
VMs running on Intel hosts with passthrough non-coherent DMA devices.

Even worse, the regions that are typically covered by the default MTRR are either
host MMIO or emulated MMIO.  For host MMIO, KVM darn well needs to ensure that
memory is UC.  And for emulated MMIO, even if KVM did virtualize MTRRs, there
would *still* be zero impact on the effective memtype.

In other words, irrespective of SNP and TDX, programming the MTRRs under KVM
does nothing more than give the kernel warm fuzzies.  But the _software_ tracking
of what the kernel thinks are the requisite memtypes obviously matters.

As a stopgap, rather than carry this CoCo hack, what if we add a synthetic feature
flag that says it's ok to modify MTRRs without disabling caching?  I think that'll
make TDX happy, and should avoid a long game of whack-a-mole.  Then longterm,
figure out a clean way to eliminate accessing the "real" MTRRs entirely.

This is safe even under older KVM, because KVM obviously isn't writing the real
MTRRs, and KVM invalidates all affected EPT entries (where KVM shoves the guest's
desired memtype) on an MTRR update.  If that's a concern we could do this only
for "special" guests, and/or add a PV CPUID flag to KVM to announce that MTRRs
are only emulated.

E.g.

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 645aa360628d..b5699eeaef5d 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -236,6 +236,7 @@
 #define X86_FEATURE_PVUNLOCK           ( 8*32+20) /* PV unlock function */
 #define X86_FEATURE_VCPUPREEMPT                ( 8*32+21) /* PV vcpu_is_preempted function */
 #define X86_FEATURE_TDX_GUEST          ( 8*32+22) /* "tdx_guest" Intel Trust Domain Extensions Guest */
+#define X86_FEATURE_EMULATED_MTRRS     ( 8*32+22) /* MTRRs are emulated and can be modified while caching is enabled */
 
 /* Intel-defined CPU features, CPUID level 0x00000007:0 (EBX), word 9 */
 #define X86_FEATURE_FSGSBASE           ( 9*32+ 0) /* "fsgsbase" RDFSBASE, WRFSBASE, RDGSBASE, WRGSBASE instructions*/
diff --git a/arch/x86/kernel/cpu/cacheinfo.c b/arch/x86/kernel/cpu/cacheinfo.c
index e6fa03ed9172..c668032d1dc1 100644
--- a/arch/x86/kernel/cpu/cacheinfo.c
+++ b/arch/x86/kernel/cpu/cacheinfo.c
@@ -1053,6 +1053,9 @@ void cache_disable(void) __acquires(cache_disable_lock)
 {
        unsigned long cr0;
 
+       if (cpu_feature_enabled(X86_FEATURE_EMULATED_MTRRS))
+               return;
+
        /*
         * Note that this is not ideal
         * since the cache is only flushed/disabled for this CPU while the
@@ -1095,6 +1098,9 @@ void cache_disable(void) __acquires(cache_disable_lock)
 
 void cache_enable(void) __releases(cache_disable_lock)
 {
+       if (cpu_feature_enabled(X86_FEATURE_EMULATED_MTRRS))
+               return;
+
        /* Flush TLBs (no need to flush caches - they are disabled) */
        count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
        flush_tlb_local();
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 21e9e4845354..6266b132e359 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -982,8 +982,7 @@ static void __init kvm_init_platform(void)
        kvmclock_init();
        x86_platform.apic_post_init = kvm_apic_init;
 
-       /* Set WB as the default cache mode for SEV-SNP and TDX */
-       mtrr_overwrite_state(NULL, 0, MTRR_TYPE_WRBACK);
+       setup_force_cpu_cap(X86_FEATURE_EMULATED_MTRRS);
 }
 
 #if defined(CONFIG_AMD_MEM_ENCRYPT)