On Tue, Oct 15, 2024, Kirill A. Shutemov wrote: > AMD SEV-SNP and Intel TDX have limited access to MTRR: either it is not > advertised in CPUID or it cannot be programmed (on TDX, due to #VE on > CR0.CD clear). > > This results in guests using uncached mappings where it shouldn't and > pmd/pud_set_huge() failures due to non-uniform memory type reported by > mtrr_type_lookup(). > > Override MTRR state, making it WB by default as the kernel does for > Hyper-V guests. In a turn of events that should have shocked no one, simply overriding the default memtype also results in breakage. The insanity that is ACPI relies on UC MTRR mappings to force memory that is very obviously not RAM to use non-WB mappings. During acpi_init(), acpi_os_map_iomem() will attempt to map devices as WB: static inline void __iomem *acpi_os_ioremap(acpi_physical_address phys, acpi_size size) { return ioremap_cache(phys, size); } If acpi_init() runs before the corresponding device driver is probed, ACPI's WB mapping will "win", and result in the driver's ioremap() failing because the existing WB mapping isn't compatible with the requested UC-. [ 1.730459] ioremap error for 0xfed40000-0xfed45000, requested 0x2, got 0x0 [ 1.732780] tpm_tis MSFT0101:00: probe with driver tpm_tis failed with error -12 Note, the '0x2' and '0x0' values refer to "enum page_cache_mode", not x86's memtypes (which frustratingly are an almost pure inversion; 2 == WB, 0 == UC). The above trace is from a Google-VMM based VM, but the same behavior happens with a QEMU based VM, so unless QEMU is also building bad ACPI tables, I don't think this is a VMM problem. E.g. tracing mapping requests for HPET under QEMU yields Mapping HPET, req_type = 0 WARNING: CPU: 5 PID: 1 at arch/x86/mm/pat/memtype.c:528 memtype_reserve+0x22f/0x3f0 Call Trace: <TASK> __ioremap_caller.constprop.0+0xd6/0x330 acpi_os_map_iomem+0x195/0x1b0 acpi_ex_system_memory_space_handler+0x11c/0x2f0 acpi_ev_address_space_dispatch+0x168/0x3b0 acpi_ex_access_region+0xd7/0x280 acpi_ex_field_datum_io+0x73/0x210 acpi_ex_extract_from_field+0x267/0x2a0 acpi_ex_read_data_from_field+0x8e/0x220 acpi_ex_resolve_node_to_value+0xe2/0x2b0 acpi_ds_evaluate_name_path+0xa9/0x100 acpi_ds_exec_end_op+0x21f/0x4c0 acpi_ps_parse_loop+0xf4/0x670 acpi_ps_parse_aml+0x17b/0x3d0 acpi_ps_execute_method+0x137/0x260 acpi_ns_evaluate+0x1f0/0x2d0 acpi_evaluate_object+0x13d/0x2e0 acpi_evaluate_integer+0x50/0xe0 acpi_bus_get_status+0x7b/0x140 acpi_add_single_object+0x3f8/0x750 acpi_bus_check_add+0xb2/0x340 acpi_ns_walk_namespace+0xfe/0x200 acpi_walk_namespace+0xbb/0xe0 acpi_bus_scan+0x1b5/0x1d0 acpi_scan_init+0xd5/0x290 acpi_init+0x1fc/0x520 do_one_initcall+0x41/0x1d0 kernel_init_freeable+0x164/0x260 kernel_init+0x16/0x1a0 ret_from_fork+0x2d/0x50 ret_from_fork_asm+0x11/0x20 ---[ end trace 0000000000000000 ]--- The only reason this doesn't cause problems for HPET is because HPET gets special treatment via x86_init.timers.timer_init(), and so gets a chance to creat its UC- mapping before acpi_init() clobbers things. E.g. modifying the horrid CoCo MTRR hack to apply to all VM types, and then disabling the early call to hpet_time_init() yields the same behavior for HPET: [ 0.318264] ioremap error for 0xfed00000-0xfed01000, requested 0x2, got 0x0 --- diff --git a/arch/x86/kernel/cpu/mtrr/generic.c b/arch/x86/kernel/cpu/mtrr/generic.c index 7b29ebda024f..8b58024611e5 100644 --- a/arch/x86/kernel/cpu/mtrr/generic.c +++ b/arch/x86/kernel/cpu/mtrr/generic.c @@ -456,11 +456,13 @@ void mtrr_overwrite_state(struct mtrr_var_range *var, unsigned int num_var, * - when running as SEV-SNP or TDX guest to avoid unnecessary * VMM communication/Virtualization exceptions (#VC, #VE) */ +#if 0 if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP) && !hv_is_isolation_supported() && !cpu_feature_enabled(X86_FEATURE_XENPV) && !cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) return; +#endif /* Disable MTRR in order to disable MTRR modifications. */ setup_clear_cpu_cap(X86_FEATURE_MTRR); diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 21e9e4845354..11f08aba1b8c 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -982,6 +982,8 @@ static void __init kvm_init_platform(void) kvmclock_init(); x86_platform.apic_post_init = kvm_apic_init; + x86_init.timers.timer_init = x86_init_noop; + /* Set WB as the default cache mode for SEV-SNP and TDX */ mtrr_overwrite_state(NULL, 0, MTRR_TYPE_WRBACK); } --- This is all honestly beyond ridiculous. As of commit 0a7b73559b39 ("KVM: x86: Remove VMX support for virtualizing guest MTRR memtypes"), MTRRs are never virtualized under KVM, i.e. the memtype in the MTRR has *zero* impact on the effective memtype. And even before that commit, KVM only virtualized MTRRs for VMs running on Intel hosts with passthrough non-coherent DMA devices. Even worse, the regions that are typically covered by the default MTRR are either host MMIO or emulated MMIO. For host MMIO, KVM darn well needs to ensure that memory is UC. And for emulated MMIO, even if KVM did virtualize MTRRs, there would *still* be zero impact on the effective memtype. In other words, irrespective of SNP and TDX, programming the MTRRs under KVM does nothing more than give the kernel warm fuzzies. But the _software_ tracking of what the kernel thinks are the requisite memtypes obviously matters. As a stopgap, rather than carry this CoCo hack, what if we add a synthetic feature flag that says it's ok to modify MTRRs without disabling caching? I think that'll make TDX happy, and should avoid a long game of whack-a-mole. Then longterm, figure out a clean way to eliminate accessing the "real" MTRRs entirely. This is safe even under older KVM, because KVM obviously isn't writing the real MTRRs, and KVM invalidates all affected EPT entries (where KVM shoves the guest's desired memtype) on an MTRR update. If that's a concern we could do this only for "special" guests, and/or add a PV CPUID flag to KVM to announce that MTRRs are only emulated. E.g. diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index 645aa360628d..b5699eeaef5d 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -236,6 +236,7 @@ #define X86_FEATURE_PVUNLOCK ( 8*32+20) /* PV unlock function */ #define X86_FEATURE_VCPUPREEMPT ( 8*32+21) /* PV vcpu_is_preempted function */ #define X86_FEATURE_TDX_GUEST ( 8*32+22) /* "tdx_guest" Intel Trust Domain Extensions Guest */ +#define X86_FEATURE_EMULATED_MTRRS ( 8*32+22) /* MTRRs are emulated and can be modified while caching is enabled */ /* Intel-defined CPU features, CPUID level 0x00000007:0 (EBX), word 9 */ #define X86_FEATURE_FSGSBASE ( 9*32+ 0) /* "fsgsbase" RDFSBASE, WRFSBASE, RDGSBASE, WRGSBASE instructions*/ diff --git a/arch/x86/kernel/cpu/cacheinfo.c b/arch/x86/kernel/cpu/cacheinfo.c index e6fa03ed9172..c668032d1dc1 100644 --- a/arch/x86/kernel/cpu/cacheinfo.c +++ b/arch/x86/kernel/cpu/cacheinfo.c @@ -1053,6 +1053,9 @@ void cache_disable(void) __acquires(cache_disable_lock) { unsigned long cr0; + if (cpu_feature_enabled(X86_FEATURE_EMULATED_MTRRS)) + return; + /* * Note that this is not ideal * since the cache is only flushed/disabled for this CPU while the @@ -1095,6 +1098,9 @@ void cache_disable(void) __acquires(cache_disable_lock) void cache_enable(void) __releases(cache_disable_lock) { + if (cpu_feature_enabled(X86_FEATURE_EMULATED_MTRRS)) + return; + /* Flush TLBs (no need to flush caches - they are disabled) */ count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); flush_tlb_local(); diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 21e9e4845354..6266b132e359 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -982,8 +982,7 @@ static void __init kvm_init_platform(void) kvmclock_init(); x86_platform.apic_post_init = kvm_apic_init; - /* Set WB as the default cache mode for SEV-SNP and TDX */ - mtrr_overwrite_state(NULL, 0, MTRR_TYPE_WRBACK); + setup_force_cpu_cap(X86_FEATURE_EMULATED_MTRRS); } #if defined(CONFIG_AMD_MEM_ENCRYPT)