Re: [RFC PATCH v2 2/3] x86: cpu/bugs: add support for AMD ERAPS feature

Dave Hansen <dave.hansen@xxxxxxxxx> · Mon, 11 Nov 2024 10:57:08 -0800

On 11/11/24 08:39, Amit Shah wrote:
> Remove explicit RET stuffing / filling on VMEXITs and context
> switches on AMD CPUs with the ERAPS feature (Zen5+).

I'm not a super big fan of how this changelog is constructed.  I
personally really like the "background, problem, solution" form.  This
starts with the solution before even telling us what it is.

> With the Enhanced Return Address Prediction Security feature,  any of
> the following conditions triggers a flush of the RSB (aka RAP in AMD
> manuals) in hardware:
> * context switch (e.g., move to CR3)

IMNHO, context switching is a broader topic that process-to-process
switches.  A user=>kernel thread switch is definitely a context switch,
but doesn't involve a CR3 write.  A user=>user switch in the same mm is
also highly arguable to be a context switch and doesn't involve a CR3
write.  There are userspace threading libraries that do "context switches".

This is further muddled by the very comments you are editing in this
patch like: "unconditionally fill RSB on context switches".

Please be very, very precise in the language that's chosen here.

> * TLB flush
> * some writes to CR4

... and only one of those is relevant for this patch.  Aren't the others
just noise?

> The feature also explicitly tags host and guest addresses in the RSB -
> eliminating the need for explicit flushing of the RSB on VMEXIT.
> 
> [RFC note: We'll wait for the APM to be updated with the real wording,
> but assuming it's going to say the ERAPS feature works the way described
> above, let's continue the discussion re: when the kernel currently calls
> FILL_RETURN_BUFFER, and what dropping it entirely means.
> 
> Dave Hansen pointed out __switch_to_asm fills the RSB each time it's
> called, so let's address the cases there:
> 
> 1. user->kernel switch: Switching from userspace to kernelspace, and
>    then using user-stuffed RSB entries in the kernel is not possible
>    thanks to SMEP.  We can safely drop the FILL_RETURN_BUFFER call for
>    this case.  In fact, this is how the original code was when dwmw2
>    added it originally in c995efd5a.  So while this case currently
>    triggers an RSB flush (and will not after this ERAPS patch), the
>    current flush isn't necessary for AMD systems with SMEP anyway.

This description makes me uneasy.  It basically boils down to, "SMEP
guarantees that the kernel can't consume user-placed RSB entries."

It makes me uneasy because I recall that not holding true on some Intel
CPUs. I believe some CPUs have a partial-width RSB.  Kernel consumption
of a user-placed entry would induce the kernel to speculate to a
*kernel* address so SMEP is rather ineffective.

So, instead of just saying, "SMEP magically fixes everything, trust us",
could we please step through a few of the properties of the hardware and
software that make SMEP effective?

First, I think that you're saying that AMD hardware has a full-width,
precise RSB.  That ensures that speculation always happens back
precisely to the original address, not some weird alias.  Second, ERAPS
guarantees that the the address still has the same stuff mapped there.
Any change to the address space involves either a CR3 write or a TLB
flush, both of which would have flushed any user-placed content in the RSB.

	Aside: Even the kernel-only text poking mm or EFI mm would "just
	work" in this scenario since they have their own mm_structs,
	page tables and root CR3 values.

> 2. user->user or kernel->kernel: If a user->user switch does not result
>    in a CR3 change, it's a different thread in the same process context.
>    That's the same case for kernel->kernel switch.  In this case, the
>    RSB entries are still valid in that context, just not the correct
>    ones in the new thread's context.  It's difficult to imagine this
>    being a security risk.  The current code clearing it, and this patch
>    not doing so for AMD-with-ERAPS, isn't a concern as far as I see.
> ]

The current rules are dirt simple: if the kernel stack gets switched,
the RSB is flushed.  It's kinda hard to have a mismatched stack if it's
never switched in the first place.  That makes the current rules dirt
simple.  The problem (stack switching) is dirt simple to correlate 1:1
with the fix (RSB stuffing).

This patch proposes separating the problem (stack switching) and the
mitigations (implicit new microcode actions).  That's a hell of a lot
more complicated and hardware to audit than the existing rules.  Agreed?

So, how are the rules relaxed?

First, it opens up the case where user threads consume RSB entries from
other threads in the same process. Threads are usually at the same
privilege level.  Instead of using a speculative attack, an attacker
could just read the data directly.  The only place I can imagine this
even remotely being a problem would be if threads were relying on
protection keys to keep secrets from each other.

The kernel consuming RSB entries from another kernel thread seems less
clear.  I disagree with the idea that a valid entry in a given context
is a _safe_ entry though.  Spectre v2 in _general_ involves nasty
speculation to otherwise perfectly safe code. A problematic scenario
would be where kswapd wakes up after waiting for I/O and starts
speculating back along the return path of the userspace thread that
kswapd interrupted. Userspace has some level of control over both stacks
and it doesn't seem super far fetched to think that there could be some
disclosure gadgets in there.

In short: the user-consumes-user-RSB-entry attacks seem fundamentally
improbable because user threads are unlikely to have secrets from each
other.

The kernel-consumes-kernel-RSB-entry attacks seem hard rather than
fundamentally improbable.  I can't even count how many times our "oh
that attack would be too hard to pull off" optimism has gone wrong.

What does that all _mean_?  ERAPS sucks?  Nah.  Maybe we just make sure
that the existing spectre_v2=whatever controls can be used to stop
relying on it if asked.

> diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
> index 96b410b1d4e8..f5ee7fc71db5 100644
> --- a/arch/x86/include/asm/nospec-branch.h
> +++ b/arch/x86/include/asm/nospec-branch.h
> @@ -117,6 +117,18 @@
>   * We define a CPP macro such that it can be used from both .S files and
>   * inline assembly. It's possible to do a .macro and then include that
>   * from C via asm(".include <asm/nospec-branch.h>") but let's not go there.
> + *
> + * AMD CPUs with the ERAPS feature may have a larger default RSB.  These CPUs
> + * use the default number of entries on a host, and can optionally (based on
> + * hypervisor setup) use 32 (old) or the new default in a guest.  The number
> + * of default entries is reflected in CPUID 8000_0021:EBX[23:16].
> + *
> + * With the ERAPS feature, RSB filling is not necessary anymore: the RSB is
> + * auto-cleared by hardware on context switches, TLB flushes, or some CR4
> + * writes.  Adapting the value of RSB_CLEAR_LOOPS below for ERAPS would change
> + * it to a runtime variable instead of the current compile-time constant, so
> + * leave it as-is, as this works for both older CPUs, as well as newer ones
> + * with ERAPS.
>   */

This comment feels out of place and noisy to me.  Why the heck would we
need to document the RSB size enumeration here and not use it?

Then this goes on to explain why this patch *didn't* bother to change
RSB_CLEAR_LOOPS.  That seems much more like changelog material than a
code comment to me.

To me, RSB_CLEAR_LOOPS, is for, well, clearing the RSB.  The existing
comments say that some CPUs don't need to clear the RSB.  I don't think
we need further explanation of why one specific CPU doesn't need to
clear the RSB.  The new CPU isn't special.

Something slightly more useful would be to actually read CPUID
8000_0021:EBX[23:16] and compare it to RSB_CLEAR_LOOPS:

void amd_check_rsb_clearing(void)
{
	/*
	 * Systems with both these features off
	 * do not perform RSB clearing:
	 */
	if (!boot_cpu_has(X86_FEATURE_RSB_CTXSW) &&
	    !boot_cpu_has(X86_FEATURE_RSB_VMEXIT))
		return;

	// with actual helpers and mask generation, not this mess:
	rsb_size = (cpuid_ebx(0x80000021) >> 16) & 0xff;

	WARN_ON_ONCE(rsb_size > RSB_CLEAR_LOOPS);
}

That's almost shorter than the proposed comment.

>  #define RETPOLINE_THUNK_SIZE	32
> diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
> index 0aa629b5537d..02446815b0de 100644
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -1818,9 +1818,12 @@ static void __init spectre_v2_select_mitigation(void)
>  	pr_info("%s\n", spectre_v2_strings[mode]);
>  
>  	/*
> -	 * If Spectre v2 protection has been enabled, fill the RSB during a
> -	 * context switch.  In general there are two types of RSB attacks
> -	 * across context switches, for which the CALLs/RETs may be unbalanced.
> +	 * If Spectre v2 protection has been enabled, the RSB needs to be
> +	 * cleared during a context switch.  Either do it in software by
> +	 * filling the RSB, or in hardware via ERAPS.

I'd move this comment about using ERAPS down to the ERAPS check itself.

> +	 * In general there are two types of RSB attacks across context
> +	 * switches, for which the CALLs/RETs may be unbalanced.
>  	 *
>  	 * 1) RSB underflow
>  	 *
> @@ -1848,12 +1851,21 @@ static void __init spectre_v2_select_mitigation(void)
>  	 *    RSB clearing.
>  	 *
>  	 * So to mitigate all cases, unconditionally fill RSB on context
> -	 * switches.
> +	 * switches when ERAPS is not present.
>  	 */
> -	setup_force_cpu_cap(X86_FEATURE_RSB_CTXSW);
> -	pr_info("Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch\n");
> +	if (!boot_cpu_has(X86_FEATURE_ERAPS)) {
> +		setup_force_cpu_cap(X86_FEATURE_RSB_CTXSW);
> +		pr_info("Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch\n");
>  
> -	spectre_v2_determine_rsb_fill_type_at_vmexit(mode);
> +		/*
> +		 * For guest -> host (or vice versa) RSB poisoning scenarios,
> +		 * determine the mitigation mode here.  With ERAPS, RSB
> +		 * entries are tagged as host or guest - ensuring that neither
> +		 * the host nor the guest have to clear or fill RSB entries to
> +		 * avoid poisoning: skip RSB filling at VMEXIT in that case.
> +		 */
> +		spectre_v2_determine_rsb_fill_type_at_vmexit(mode);
> +	}

This seems to be following a pattern of putting excessive amounts of
very AMD-specific commenting right in the middle of common code.
There's a *HUGE* comment in
spectre_v2_determine_rsb_fill_type_at_vmexit().  Wouldn't this be more
at home there?

>  	/*
>  	 * Retpoline protects the kernel, but doesn't protect firmware.  IBRS
> @@ -2866,7 +2878,7 @@ static ssize_t spectre_v2_show_state(char *buf)
>  	    spectre_v2_enabled == SPECTRE_V2_EIBRS_LFENCE)
>  		return sysfs_emit(buf, "Vulnerable: eIBRS+LFENCE with unprivileged eBPF and SMT\n");
>  
> -	return sysfs_emit(buf, "%s%s%s%s%s%s%s%s\n",
> +	return sysfs_emit(buf, "%s%s%s%s%s%s%s%s%s\n",
>  			  spectre_v2_strings[spectre_v2_enabled],
>  			  ibpb_state(),
>  			  boot_cpu_has(X86_FEATURE_USE_IBRS_FW) ? "; IBRS_FW" : "",
> @@ -2874,6 +2886,7 @@ static ssize_t spectre_v2_show_state(char *buf)
>  			  boot_cpu_has(X86_FEATURE_RSB_CTXSW) ? "; RSB filling" : "",
>  			  pbrsb_eibrs_state(),
>  			  spectre_bhi_state(),
> +			  boot_cpu_has(X86_FEATURE_ERAPS) ? "; ERAPS hardware RSB flush" : "",
>  			  /* this should always be at the end */
>  			  spectre_v2_module_string());
>  }