Re: [PATCH AUTOSEL 5.15 11/12] x86/barrier: Do not serialize MSR accesses on AMD

Borislav Petkov <bp@xxxxxxxxx> · Thu, 28 Nov 2024 12:59:24 +0100

Hey folks,

On Mon, Jan 15, 2024 at 06:26:56PM -0500, Sasha Levin wrote:
> From: "Borislav Petkov (AMD)" <bp@xxxxxxxxx>
> 
> [ Upstream commit 04c3024560d3a14acd18d0a51a1d0a89d29b7eb5 ]
> 
> AMD does not have the requirement for a synchronization barrier when
> acccessing a certain group of MSRs. Do not incur that unnecessary
> penalty there.

Erwan just mentioned that this one is not in 6.1 and in 5.15. And I have mails
about it getting picked up by AUTOSEL.

Did the AI reconsider in the meantime?

:-P

Thx.

Two autosel messages are:

https://lore.kernel.org/r/20240115232718.209642-11-sashal@xxxxxxxxxx
https://lore.kernel.org/r/20240115232611.209265-13-sashal@xxxxxxxxxx

Leaving in the rest for reference.

> There will be a CPUID bit which explicitly states that a MFENCE is not
> needed. Once that bit is added to the APM, this will be extended with
> it.
> 
> While at it, move to processor.h to avoid include hell. Untangling that
> file properly is a matter for another day.
> 
> Some notes on the performance aspect of why this is relevant, courtesy
> of Kishon VijayAbraham <Kishon.VijayAbraham@xxxxxxx>:
> 
> On a AMD Zen4 system with 96 cores, a modified ipi-bench[1] on a VM
> shows x2AVIC IPI rate is 3% to 4% lower than AVIC IPI rate. The
> ipi-bench is modified so that the IPIs are sent between two vCPUs in the
> same CCX. This also requires to pin the vCPU to a physical core to
> prevent any latencies. This simulates the use case of pinning vCPUs to
> the thread of a single CCX to avoid interrupt IPI latency.
> 
> In order to avoid run-to-run variance (for both x2AVIC and AVIC), the
> below configurations are done:
> 
>   1) Disable Power States in BIOS (to prevent the system from going to
>      lower power state)
> 
>   2) Run the system at fixed frequency 2500MHz (to prevent the system
>      from increasing the frequency when the load is more)
> 
> With the above configuration:
> 
> *) Performance measured using ipi-bench for AVIC:
>   Average Latency:  1124.98ns [Time to send IPI from one vCPU to another vCPU]
> 
>   Cumulative throughput: 42.6759M/s [Total number of IPIs sent in a second from
>   				     48 vCPUs simultaneously]
> 
> *) Performance measured using ipi-bench for x2AVIC:
>   Average Latency:  1172.42ns [Time to send IPI from one vCPU to another vCPU]
> 
>   Cumulative throughput: 40.9432M/s [Total number of IPIs sent in a second from
>   				     48 vCPUs simultaneously]
> 
> From above, x2AVIC latency is ~4% more than AVIC. However, the expectation is
> x2AVIC performance to be better or equivalent to AVIC. Upon analyzing
> the perf captures, it is observed significant time is spent in
> weak_wrmsr_fence() invoked by x2apic_send_IPI().
> 
> With the fix to skip weak_wrmsr_fence()
> 
> *) Performance measured using ipi-bench for x2AVIC:
>   Average Latency:  1117.44ns [Time to send IPI from one vCPU to another vCPU]
> 
>   Cumulative throughput: 42.9608M/s [Total number of IPIs sent in a second from
>   				     48 vCPUs simultaneously]
> 
> Comparing the performance of x2AVIC with and without the fix, it can be seen
> the performance improves by ~4%.
> 
> Performance captured using an unmodified ipi-bench using the 'mesh-ipi' option
> with and without weak_wrmsr_fence() on a Zen4 system also showed significant
> performance improvement without weak_wrmsr_fence(). The 'mesh-ipi' option ignores
> CCX or CCD and just picks random vCPU.
> 
>   Average throughput (10 iterations) with weak_wrmsr_fence(),
>         Cumulative throughput: 4933374 IPI/s
> 
>   Average throughput (10 iterations) without weak_wrmsr_fence(),
>         Cumulative throughput: 6355156 IPI/s
> 
> [1] https://github.com/bytedance/kvm-utils/tree/master/microbenchmark/ipi-bench
> 
> Signed-off-by: Borislav Petkov (AMD) <bp@xxxxxxxxx>
> Link: https://lore.kernel.org/r/20230622095212.20940-1-bp@xxxxxxxxx
> Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>
> ---
>  arch/x86/include/asm/barrier.h     | 18 ------------------
>  arch/x86/include/asm/cpufeatures.h |  2 +-
>  arch/x86/include/asm/processor.h   | 18 ++++++++++++++++++
>  arch/x86/kernel/cpu/amd.c          |  3 +++
>  arch/x86/kernel/cpu/common.c       |  7 +++++++
>  arch/x86/kernel/cpu/hygon.c        |  3 +++
>  6 files changed, 32 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
> index 3ba772a69cc8..dab2db15a8c4 100644
> --- a/arch/x86/include/asm/barrier.h
> +++ b/arch/x86/include/asm/barrier.h
> @@ -81,22 +81,4 @@ do {									\
>  
>  #include <asm-generic/barrier.h>
>  
> -/*
> - * Make previous memory operations globally visible before
> - * a WRMSR.
> - *
> - * MFENCE makes writes visible, but only affects load/store
> - * instructions.  WRMSR is unfortunately not a load/store
> - * instruction and is unaffected by MFENCE.  The LFENCE ensures
> - * that the WRMSR is not reordered.
> - *
> - * Most WRMSRs are full serializing instructions themselves and
> - * do not require this barrier.  This is only required for the
> - * IA32_TSC_DEADLINE and X2APIC MSRs.
> - */
> -static inline void weak_wrmsr_fence(void)
> -{
> -	asm volatile("mfence; lfence" : : : "memory");
> -}
> -
>  #endif /* _ASM_X86_BARRIER_H */
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index d6089072ee41..dd6f714c215e 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -305,10 +305,10 @@
>  
>  
>  #define X86_FEATURE_MSR_TSX_CTRL	(11*32+20) /* "" MSR IA32_TSX_CTRL (Intel) implemented */
> -
>  #define X86_FEATURE_SRSO		(11*32+24) /* "" AMD BTB untrain RETs */
>  #define X86_FEATURE_SRSO_ALIAS		(11*32+25) /* "" AMD BTB untrain RETs through aliasing */
>  #define X86_FEATURE_IBPB_ON_VMEXIT	(11*32+26) /* "" Issue an IBPB only on VMEXIT */
> +#define X86_FEATURE_APIC_MSRS_FENCE	(11*32+27) /* "" IA32_TSC_DEADLINE and X2APIC MSRs need fencing */
>  
>  /* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
>  #define X86_FEATURE_AVX_VNNI		(12*32+ 4) /* AVX VNNI instructions */
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index bbbf27cfe701..52329752dd4f 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -861,4 +861,22 @@ enum mds_mitigations {
>  
>  extern bool gds_ucode_mitigated(void);
>  
> +/*
> + * Make previous memory operations globally visible before
> + * a WRMSR.
> + *
> + * MFENCE makes writes visible, but only affects load/store
> + * instructions.  WRMSR is unfortunately not a load/store
> + * instruction and is unaffected by MFENCE.  The LFENCE ensures
> + * that the WRMSR is not reordered.
> + *
> + * Most WRMSRs are full serializing instructions themselves and
> + * do not require this barrier.  This is only required for the
> + * IA32_TSC_DEADLINE and X2APIC MSRs.
> + */
> +static inline void weak_wrmsr_fence(void)
> +{
> +	alternative("mfence; lfence", "", ALT_NOT(X86_FEATURE_APIC_MSRS_FENCE));
> +}
> +
>  #endif /* _ASM_X86_PROCESSOR_H */
> diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> index dba7fe7ecea9..06186b7d0ed0 100644
> --- a/arch/x86/kernel/cpu/amd.c
> +++ b/arch/x86/kernel/cpu/amd.c
> @@ -1158,6 +1158,9 @@ static void init_amd(struct cpuinfo_x86 *c)
>  	if (!cpu_has(c, X86_FEATURE_HYPERVISOR) &&
>  	     cpu_has_amd_erratum(c, amd_erratum_1485))
>  		msr_set_bit(MSR_ZEN4_BP_CFG, MSR_ZEN4_BP_CFG_SHARED_BTB_FIX_BIT);
> +
> +	/* AMD CPUs don't need fencing after x2APIC/TSC_DEADLINE MSR writes. */
> +	clear_cpu_cap(c, X86_FEATURE_APIC_MSRS_FENCE);
>  }
>  
>  #ifdef CONFIG_X86_32
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 01c4f8f45b83..1e870f6dd51c 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -1687,6 +1687,13 @@ static void identify_cpu(struct cpuinfo_x86 *c)
>  	c->apicid = apic->phys_pkg_id(c->initial_apicid, 0);
>  #endif
>  
> +
> +	/*
> +	 * Set default APIC and TSC_DEADLINE MSR fencing flag. AMD and
> +	 * Hygon will clear it in ->c_init() below.
> +	 */
> +	set_cpu_cap(c, X86_FEATURE_APIC_MSRS_FENCE);
> +
>  	/*
>  	 * Vendor-specific initialization.  In this section we
>  	 * canonicalize the feature flags, meaning if there are
> diff --git a/arch/x86/kernel/cpu/hygon.c b/arch/x86/kernel/cpu/hygon.c
> index 9e8380bd4fb9..8a80d5343f3a 100644
> --- a/arch/x86/kernel/cpu/hygon.c
> +++ b/arch/x86/kernel/cpu/hygon.c
> @@ -347,6 +347,9 @@ static void init_hygon(struct cpuinfo_x86 *c)
>  		set_cpu_bug(c, X86_BUG_SYSRET_SS_ATTRS);
>  
>  	check_null_seg_clears_base(c);
> +
> +	/* Hygon CPUs don't need fencing after x2APIC/TSC_DEADLINE MSR writes. */
> +	clear_cpu_cap(c, X86_FEATURE_APIC_MSRS_FENCE);
>  }
>  
>  static void cpu_detect_tlb_hygon(struct cpuinfo_x86 *c)
> -- 
> 2.43.0
> 

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette