[PATCH 2/2] x86/kvm: Override low memory above TOLUD to WB when MTRRs are forced WB

Sean Christopherson <seanjc@xxxxxxxxxx> · Fri, 31 Jan 2025 16:50:48 -0800

When running as an SNP or TDX guest under KVM, treat the legacy PCI hole,
i.e. memory between Top of Lower Usable DRAM and 4GiB, as an untracked PAT
range to workaround issues with mapping legacy devices when MTRRs are
forced to WB.

In most KVM-based setups, legacy devices such as the HPET and TPM are
enumerated via ACPI.  For unknown reasons, ACPI auto-maps such devices as
WB, whereas the dedicated device drivers map memory as WC or UC.  In normal
setups, the entire mess "works" as firmware configures the PCI hole (and
other device memory) to be UC in the MTRRs.  As a result, the ACPI mappings
end up UC, which is compatible with the drivers' requested WC/UC-.

With WB MTRRs, the ACPI mappings get their requested WB.  If acpi_init()
runs before the corresponding device driver is probed, ACPI's WB mapping
will "win", and result in the driver's ioremap() failing because the
existing WB mapping isn't compatible with the requested WC/UC-.

E.g. when a TPM is emulated by the hypervisor (ignoring the security
implications of relying on what is allegedly an untrusted entity to store
measurements), the TPM driver will request UC and fail:

  [  1.730459] ioremap error for 0xfed40000-0xfed45000, requested 0x2, got 0x0
  [  1.732780] tpm_tis MSFT0101:00: probe with driver tpm_tis failed with error -12

Note, the '0x2' and '0x0' values refer to "enum page_cache_mode", not x86's
memtypes (which frustratingly are an almost pure inversion; 2 == WB, 0 == UC).

The above trace is from a Google-VMM based VM, but the same behavior
happens with a QEMU based VM.  E.g. tracing mapping requests for HPET under
QEMU yields:

   Mapping HPET, req_type = 0
   WARNING: CPU: 5 PID: 1 at arch/x86/mm/pat/memtype.c:528 memtype_reserve+0x22f/0x3f0
   Call Trace:
    <TASK>
    __ioremap_caller.constprop.0+0xd6/0x330
    acpi_os_map_iomem+0x195/0x1b0
    acpi_ex_system_memory_space_handler+0x11c/0x2f0
    acpi_ev_address_space_dispatch+0x168/0x3b0
    acpi_ex_access_region+0xd7/0x280
    acpi_ex_field_datum_io+0x73/0x210
    acpi_ex_extract_from_field+0x267/0x2a0
    acpi_ex_read_data_from_field+0x8e/0x220
    acpi_ex_resolve_node_to_value+0xe2/0x2b0
    acpi_ds_evaluate_name_path+0xa9/0x100
    acpi_ds_exec_end_op+0x21f/0x4c0
    acpi_ps_parse_loop+0xf4/0x670
    acpi_ps_parse_aml+0x17b/0x3d0
    acpi_ps_execute_method+0x137/0x260
    acpi_ns_evaluate+0x1f0/0x2d0
    acpi_evaluate_object+0x13d/0x2e0
    acpi_evaluate_integer+0x50/0xe0
    acpi_bus_get_status+0x7b/0x140
    acpi_add_single_object+0x3f8/0x750
    acpi_bus_check_add+0xb2/0x340
    acpi_ns_walk_namespace+0xfe/0x200
    acpi_walk_namespace+0xbb/0xe0
    acpi_bus_scan+0x1b5/0x1d0
    acpi_scan_init+0xd5/0x290
    acpi_init+0x1fc/0x520
    do_one_initcall+0x41/0x1d0
    kernel_init_freeable+0x164/0x260
    kernel_init+0x16/0x1a0
    ret_from_fork+0x2d/0x50
    ret_from_fork_asm+0x11/0x20
   ---[ end trace 0000000000000000 ]---

The only reason this doesn't cause problems for HPET is because HPET gets
special treatment via x86_init.timers.timer_init(), and so gets a chance
to create its UC- mapping before acpi_init() clobbers things.  Disabling
the early call to hpet_time_init() yields the same behavior for HPET:

  [  0.318264] ioremap error for 0xfed00000-0xfed01000, requested 0x2, got 0x0

Hack around the mess by forcing such mappings to WB, as the memory type is
irrevelant.  Even in a theoretical setup where such devices are passed
through by the host, i.e. point at real MMIO memory, it is KVM's (as the
hypervisor) responsibility to force the memory to be WC/UC, e.g. via EPT
memtype under TDX or real hardware MTRRs under SNP.  Not doing so cannot
work, and the hypervisor is highly motivated to do the right thing as
letting the guest access hardware MMIO with WB would likely result in a
variety of fatal #MCs.

Limit the hack to the legacy PCI hole on the off chance that there are
use cases that want to map virtual devices with WC/UC.  E.g. in theory, it
would be possible to expose hardware GPU buffers to an SNP or TDX guest.
Extending the hack, e.g. if there are use cases for memory above 4GiB that
are affected by ACPI, is far easier than debugging memory corruption if a
driver requests WC/UC and silently gets WB.

Double down on forcing everything to WB, e.g. instead of fixing the CR0.CD
issue and reverting to a "normal" model, as OVMF has also been taught to
ignore MTRRs when running as a TDX guest:

  3a3b12cbda ("UefiCpuPkg/MtrrLib: MtrrLibIsMtrrSupported always return FALSE in TD-Guest")
  071d2cfab8 ("OvmfPkg/Sec: Skip setup MTRR early in TD-Guest")

And running with firmware that doesn't program MTRRs would likely put the
kernel back into the conundrum of ACPI mapping devices WB, with drivers
wanting WC/UC-.

Fixes: 8e690b817e38 ("x86/kvm: Override default caching mode for SEV-SNP and TDX")
Cc: stable@xxxxxxxxxxxxxxx
Cc: Dionna Glaze <dionnaglaze@xxxxxxxxxx>
Cc: Peter Gonda <pgonda@xxxxxxxxxx>
Cc: Jürgen Groß <jgross@xxxxxxxx>
Cc: Kirill Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx>
Cc: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>
Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Borislav Petkov <bp@xxxxxxxxx>
Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
Cc: H. Peter Anvin <hpa@xxxxxxxxx>
Cc: Binbin Wu <binbin.wu@xxxxxxxxx>
Cc: Tom Lendacky <thomas.lendacky@xxxxxxx>
Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
---
 arch/x86/kernel/kvm.c | 31 +++++++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 7a422a6c5983..7ae294fe99c3 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -931,6 +931,23 @@ static void kvm_sev_hc_page_enc_status(unsigned long pfn, int npages, bool enc)
 			   KVM_MAP_GPA_RANGE_ENC_STAT(enc) | KVM_MAP_GPA_RANGE_PAGE_SZ_4K);
 }
 
+static u64 kvm_tolud __ro_after_init;
+
+static bool kvm_is_forced_wb_range(u64 start, u64 end)
+{
+	/*
+	 * In addition to the standard ISA override, force all low memory above
+	 * TOLUD to WB so that legacy devices are mapped with WB when running
+	 * as an SNP or TDX guest.  The memtype itself is completely irrevelant
+	 * as the devices are emulated, the override^Whack is needed purely to
+	 * avoid failures due to ACPI mapping device memory as WB in advance of
+	 * device drivers requesting WC or UC.  In a system with MTRRs, ACPI's
+	 * mappings get forced to UC via MTRRs (programmed sanely by firmware).
+	 */
+	return is_ISA_range(start, end) ||
+	       (start >= kvm_tolud && end <= SZ_4G);
+}
+
 static void __init kvm_init_platform(void)
 {
 	if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) &&
@@ -982,8 +999,18 @@ static void __init kvm_init_platform(void)
 	kvmclock_init();
 	x86_platform.apic_post_init = kvm_apic_init;
 
-	/* Set WB as the default cache mode for SEV-SNP and TDX */
-	guest_force_mtrr_state(NULL, 0, MTRR_TYPE_WRBACK);
+	/*
+	 * Set WB as the default cache mode for SEV-SNP and TDX.  MTRRs may be
+	 * enumerated as supported, but neither the TDX-Module (Secure EPT) nor
+	 * KVM (normal EPT for TDX, virtual MTRRs for NPT) actually virtualizes
+	 * MTRR memory types.  If MTRRs are forced to writeback, register KVM's
+	 * range-based WB override to handle cases where device drivers try to
+	 * map an emulated device's memory as WC, and fail because it's all WB.
+	 */
+	if (guest_force_mtrr_state(NULL, 0, MTRR_TYPE_WRBACK)) {
+		kvm_tolud = (e820__end_of_low_ram_pfn() << PAGE_SHIFT);
+		x86_platform.is_untracked_pat_range = kvm_is_forced_wb_range;
+	}
 }
 
 #if defined(CONFIG_AMD_MEM_ENCRYPT)
-- 
2.48.1.362.g079036d154-goog