On Tue, 2023-06-06 at 15:38 +0300, kirill.shutemov@xxxxxxxxxxxxxxx wrote: > On Mon, Jun 05, 2023 at 02:27:17AM +1200, Kai Huang wrote: > > TDX memory has integrity and confidentiality protections. Violations of > > this integrity protection are supposed to only affect TDX operations and > > are never supposed to affect the host kernel itself. In other words, > > the host kernel should never, itself, see machine checks induced by the > > TDX integrity hardware. > > > > Alas, the first few generations of TDX hardware have an erratum. A > > "partial" write to a TDX private memory cacheline will silently "poison" > > the line. Subsequent reads will consume the poison and generate a > > machine check. According to the TDX hardware spec, neither of these > > things should have happened. > > > > Virtually all kernel memory accesses operations happen in full > > cachelines. In practice, writing a "byte" of memory usually reads a 64 > > byte cacheline of memory, modifies it, then writes the whole line back. > > Those operations do not trigger this problem. > > > > This problem is triggered by "partial" writes where a write transaction > > of less than cacheline lands at the memory controller. The CPU does > > these via non-temporal write instructions (like MOVNTI), or through > > UC/WC memory mappings. The issue can also be triggered away from the > > CPU by devices doing partial writes via DMA. > > > > With this erratum, there are additional things need to be done around > > machine check handler and kexec(), etc. Similar to other CPU bugs, use > > a CPU bug bit to indicate this erratum, and detect this erratum during > > early boot. Note this bug reflects the hardware thus it is detected > > regardless of whether the kernel is built with TDX support or not. > > > > Signed-off-by: Kai Huang <kai.huang@xxxxxxxxx> > > --- > > > > v10 -> v11: > > - New patch > > > > --- > > arch/x86/include/asm/cpufeatures.h | 1 + > > arch/x86/kernel/cpu/intel.c | 21 +++++++++++++++++++++ > > 2 files changed, 22 insertions(+) > > > > diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h > > index cb8ca46213be..dc8701f8d88b 100644 > > --- a/arch/x86/include/asm/cpufeatures.h > > +++ b/arch/x86/include/asm/cpufeatures.h > > @@ -483,5 +483,6 @@ > > #define X86_BUG_RETBLEED X86_BUG(27) /* CPU is affected by RETBleed */ > > #define X86_BUG_EIBRS_PBRSB X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */ > > #define X86_BUG_SMT_RSB X86_BUG(29) /* CPU is vulnerable to Cross-Thread Return Address Predictions */ > > +#define X86_BUG_TDX_PW_MCE X86_BUG(30) /* CPU may incur #MC if non-TD software does partial write to TDX private memory */ > > > > #endif /* _ASM_X86_CPUFEATURES_H */ > > diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c > > index 1c4639588ff9..251b333e53d2 100644 > > --- a/arch/x86/kernel/cpu/intel.c > > +++ b/arch/x86/kernel/cpu/intel.c > > @@ -1552,3 +1552,24 @@ u8 get_this_hybrid_cpu_type(void) > > > > return cpuid_eax(0x0000001a) >> X86_HYBRID_CPU_TYPE_ID_SHIFT; > > } > > + > > +/* > > + * These CPUs have an erratum. A partial write from non-TD > > + * software (e.g. via MOVNTI variants or UC/WC mapping) to TDX > > + * private memory poisons that memory, and a subsequent read of > > + * that memory triggers #MC. > > + */ > > +static const struct x86_cpu_id tdx_pw_mce_cpu_ids[] __initconst = { > > + X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, NULL), > > + X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, NULL), > > + { } > > +}; > > + > > +static int __init tdx_erratum_detect(void) > > +{ > > + if (x86_match_cpu(tdx_pw_mce_cpu_ids)) > > + setup_force_cpu_bug(X86_BUG_TDX_PW_MCE); > > + > > + return 0; > > +} > > +early_initcall(tdx_erratum_detect); > > Initcall? Don't we already have a codepath to call it directly? > Maybe cpu_set_bug_bits()? > I didn't like doing in cpu_set_bug_bits() because it appears the bugs that handled in that function seem to have some dependency. For instance, if a CPU is in the whitelist of NO_SPECULATION, then this function simply returns and assumes all other bugs are not present: static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c) { u64 ia32_cap = x86_read_arch_cap_msr(); ... if (cpu_matches(cpu_vuln_whitelist, NO_SPECULATION)) return; setup_force_cpu_bug(X86_BUG_SPECTRE_V1); ... } This TDX erratum is quite self contained thus I think using some initcall is the cleanest way to do. And there are other bug flags that are handled in other places but not in cpu_set_bug_bits(), for instance, static void init_intel(struct cpuinfo_x86 *c) { ... if (c->x86 == 6 && boot_cpu_has(X86_FEATURE_MWAIT) && ((c->x86_model == INTEL_FAM6_ATOM_GOLDMONT))) set_cpu_bug(c, X86_BUG_MONITOR); ... } So it seems there's no hard rule that all bugs need to be done in cpu_set_bug_bits().