Re: [PATCH v11 04/20] x86/cpu: Detect TDX partial write machine check erratum

"Huang, Kai" <kai.huang@xxxxxxxxx> · Tue, 6 Jun 2023 22:58:04 +0000

On Tue, 2023-06-06 at 15:38 +0300, kirill.shutemov@xxxxxxxxxxxxxxx wrote:
> On Mon, Jun 05, 2023 at 02:27:17AM +1200, Kai Huang wrote:
> > TDX memory has integrity and confidentiality protections.  Violations of
> > this integrity protection are supposed to only affect TDX operations and
> > are never supposed to affect the host kernel itself.  In other words,
> > the host kernel should never, itself, see machine checks induced by the
> > TDX integrity hardware.
> > 
> > Alas, the first few generations of TDX hardware have an erratum.  A
> > "partial" write to a TDX private memory cacheline will silently "poison"
> > the line.  Subsequent reads will consume the poison and generate a
> > machine check.  According to the TDX hardware spec, neither of these
> > things should have happened.
> > 
> > Virtually all kernel memory accesses operations happen in full
> > cachelines.  In practice, writing a "byte" of memory usually reads a 64
> > byte cacheline of memory, modifies it, then writes the whole line back.
> > Those operations do not trigger this problem.
> > 
> > This problem is triggered by "partial" writes where a write transaction
> > of less than cacheline lands at the memory controller.  The CPU does
> > these via non-temporal write instructions (like MOVNTI), or through
> > UC/WC memory mappings.  The issue can also be triggered away from the
> > CPU by devices doing partial writes via DMA.
> > 
> > With this erratum, there are additional things need to be done around
> > machine check handler and kexec(), etc.  Similar to other CPU bugs, use
> > a CPU bug bit to indicate this erratum, and detect this erratum during
> > early boot.  Note this bug reflects the hardware thus it is detected
> > regardless of whether the kernel is built with TDX support or not.
> > 
> > Signed-off-by: Kai Huang <kai.huang@xxxxxxxxx>
> > ---
> > 
> > v10 -> v11:
> >  - New patch
> > 
> > ---
> >  arch/x86/include/asm/cpufeatures.h |  1 +
> >  arch/x86/kernel/cpu/intel.c        | 21 +++++++++++++++++++++
> >  2 files changed, 22 insertions(+)
> > 
> > diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> > index cb8ca46213be..dc8701f8d88b 100644
> > --- a/arch/x86/include/asm/cpufeatures.h
> > +++ b/arch/x86/include/asm/cpufeatures.h
> > @@ -483,5 +483,6 @@
> >  #define X86_BUG_RETBLEED		X86_BUG(27) /* CPU is affected by RETBleed */
> >  #define X86_BUG_EIBRS_PBRSB		X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */
> >  #define X86_BUG_SMT_RSB			X86_BUG(29) /* CPU is vulnerable to Cross-Thread Return Address Predictions */
> > +#define X86_BUG_TDX_PW_MCE		X86_BUG(30) /* CPU may incur #MC if non-TD software does partial write to TDX private memory */
> >  
> >  #endif /* _ASM_X86_CPUFEATURES_H */
> > diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> > index 1c4639588ff9..251b333e53d2 100644
> > --- a/arch/x86/kernel/cpu/intel.c
> > +++ b/arch/x86/kernel/cpu/intel.c
> > @@ -1552,3 +1552,24 @@ u8 get_this_hybrid_cpu_type(void)
> >  
> >  	return cpuid_eax(0x0000001a) >> X86_HYBRID_CPU_TYPE_ID_SHIFT;
> >  }
> > +
> > +/*
> > + * These CPUs have an erratum.  A partial write from non-TD
> > + * software (e.g. via MOVNTI variants or UC/WC mapping) to TDX
> > + * private memory poisons that memory, and a subsequent read of
> > + * that memory triggers #MC.
> > + */
> > +static const struct x86_cpu_id tdx_pw_mce_cpu_ids[] __initconst = {
> > +	X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, NULL),
> > +	X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, NULL),
> > +	{ }
> > +};
> > +
> > +static int __init tdx_erratum_detect(void)
> > +{
> > +	if (x86_match_cpu(tdx_pw_mce_cpu_ids))
> > +		setup_force_cpu_bug(X86_BUG_TDX_PW_MCE);
> > +
> > +	return 0;
> > +}
> > +early_initcall(tdx_erratum_detect);
> 
> Initcall? Don't we already have a codepath to call it directly?
> Maybe cpu_set_bug_bits()?
> 
I  didn't like doing in cpu_set_bug_bits() because it appears the bugs that
handled in that function seem to have some dependency.  For instance, if a CPU
is in the whitelist of NO_SPECULATION, then this function simply returns and
assumes all other bugs are not present:

static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
{                       
        u64 ia32_cap = x86_read_arch_cap_msr();

	...

        if (cpu_matches(cpu_vuln_whitelist, NO_SPECULATION))
                return;

        setup_force_cpu_bug(X86_BUG_SPECTRE_V1);

	...
}

This TDX erratum is quite self contained thus I think using some initcall is the
cleanest way to do.

And there  are other bug flags that are handled in other places  but not in
cpu_set_bug_bits(), for instance, 

static void init_intel(struct cpuinfo_x86 *c)
{               
	...

        if (c->x86 == 6 && boot_cpu_has(X86_FEATURE_MWAIT) &&
                ((c->x86_model == INTEL_FAM6_ATOM_GOLDMONT)))
                set_cpu_bug(c, X86_BUG_MONITOR);

	...
}

So it seems there's no hard rule that all bugs need to be done in
cpu_set_bug_bits().