RE: [patch v2 21/38] x86/cpu: Provide cpu_init/parse_topology()

"Michael Kelley (LINUX)" <mikelley@xxxxxxxxxxxxx> · Mon, 31 Jul 2023 04:05:39 +0000

From: Thomas Gleixner <tglx@xxxxxxxxxxxxx> Sent: Friday, July 28, 2023 5:13 AM
> 
> Topology evaluation is a complete disaster and impenetrable mess. It's
> scattered all over the place with some vendor implementatins doing early
> evaluation and some not. The most horrific part is the permanent
> overwriting of smt_max_siblings and __max_die_per_package, instead of
> establishing them once on the boot CPU and validating the result on the
> APs.
> 
> The goals are:
> 
>   - One topology evaluation entry point
> 
>   - Proper sharing of pointlessly duplicated code
> 
>   - Proper structuring of the evaluation logic and preferences.
> 
>   - Evaluating important system wide information only once on the boot CPU
> 
>   - Making the 0xb/0x1f leaf parsing less convoluted and actually fixing
>     the short comings of leaf 0x1f evaluation.
> 
> Start to consolidate the topology evaluation code by providing the entry
> points for the early boot CPU evaluation and for the final parsing on the
> boot CPU and the APs.
> 
> Move the trivial pieces into that new code:
> 
>    - The initialization of cpuinfo_x86::topo
> 
>    - The evaluation of CPUID leaf 1, which presets topo::initial_apicid
> 
>    - topo_apicid is set to topo::initial_apicid when invoked from early
>      boot. When invoked for the final evaluation on the boot CPU it reads
>      the actual APIC ID,  which makes apic_get_initial_apicid() obsolete
>      once everything is converted over.

> 
> Provide a temporary helper function topo_converted() which shields off the
> not yet converted CPU vendors from invoking code which would break them.
> This shielding covers all vendor CPUs which support SMP, but not the
> historical pure UP ones as they only need the topology info init and
> eventually the initial APIC initialization.
> 
> Provide two new members in cpuinfo_x86::topo to store the maximum number of
> SMT siblings and the number of dies per package and add them to the debugfs
> readout. These two members will be used to populate this information on the
> boot CPU and to validate the APs against it.
> 
> Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> ---
>  arch/x86/include/asm/topology.h       |   19 +++
>  arch/x86/kernel/cpu/Makefile          |    3
>  arch/x86/kernel/cpu/common.c          |   23 +---
>  arch/x86/kernel/cpu/cpu.h             |    6 +
>  arch/x86/kernel/cpu/debugfs.c         |   37 ++++++
>  arch/x86/kernel/cpu/topology.h        |   32 +++++
>  arch/x86/kernel/cpu/topology_common.c |  187
> ++++++++++++++++++++++++++++++++++
>  7 files changed, 290 insertions(+), 17 deletions(-)
> 

[snip]

> +
> +static void parse_topology(struct topo_scan *tscan, bool early)
> +{
> +	const struct cpuinfo_topology topo_defaults = {
> +		.cu_id			= 0xff,
> +		.llc_id			= BAD_APICID,
> +		.l2c_id			= BAD_APICID,
> +	};
> +	struct cpuinfo_x86 *c = tscan->c;
> +	struct {
> +		u32	unused0		: 16,
> +			nproc		:  8,
> +			apicid		:  8;
> +	} ebx;
> +
> +	c->topo = topo_defaults;
> +
> +	if (fake_topology(tscan))
> +	    return;
> +
> +	/* Preset Initial APIC ID from CPUID leaf 1 */
> +	cpuid_leaf_reg(1, CPUID_EBX, &ebx);
> +	c->topo.initial_apicid = ebx.apicid;
> +
> +	/*
> +	 * The initial invocation from early_identify_cpu() happens before
> +	 * the APIC is mapped or X2APIC enabled. For establishing the
> +	 * topology, that's not required. Use the initial APIC ID.
> +	 */
> +	if (early)
> +		c->topo.apicid = c->topo.initial_apicid;
> +	else
> +		c->topo.apicid = read_apic_id();

Using the value from the local APIC ID reg turns out to cause a problem in
some Hyper-V VM configurations.  If a VM has multiple L3 caches (probably
due to multiple NUMA nodes) and the # of CPUs in the span of the L3 cache
is not a power of 2, the APIC IDs for the CPUs in the span of the 1st L3 cache
are sequential starting with 0.  But then there is a gap before starting the
APIC IDs for the CPUs in the span of the 2nd L3 cache.  The gap is
repeated if there are additional L3 caches.

The CPUID instruction executed on a guest vCPU correctly reports the APIC
IDs.  However, the ACPI MADT assigns the APIC IDs sequentially with no
gaps, and the guest firmware sets the APIC_ID register for each local APIC
to match the MADT.  When parse_topology() sets the apicid field based on
reading the local APIC ID register, the value it sets is different from the
initial_apicid value for CPUs in the span of the 2nd and subsequent L3
caches, because there's no gap in the APIC IDs read from the local APIC.
Linux boots and runs, but the topology is set up with the wrong span for
the L3 cache and for the associated scheduling domains. 

The old code derives the apicid from the initial_apicid via the
phys_pkg_id() callback, so these bad Hyper-V VM configs skate by.  The
wrong value in the local APIC ID register and MADT does not affect
anything, except that the check in validate_apic_and_package_id() fails
during boot, and a set of "Firmware bug:" messages is correctly output.

Three thoughts:

1)  Are Hyper-V VMs the only place where the local APIC ID register might
have a bogus value?  Probably so, but you never know what might crawl out.

2) The natural response is "Well, fix Hyper-V!"  I first had this conversation
with the Hyper-V team about 5 years ago.  Some cases of the problem were
fixed, but some cases remain unfixed.  It's a long story.

3)  Since Hyper-V code in Linux already has an override for the apic->read()
function, it's possible to do a hack in that override so that apicid gets set to
the same value as initial_apicid, which matches the old code.  Here's the diff:

diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c
index 72d9931da3a2..2e7b18557186 100644
--- a/arch/x86/hyperv/hv_apic.c
+++ b/arch/x86/hyperv/hv_apic.c
@@ -58,6 +58,8 @@ static u32 hv_apic_read(u32 reg)
        u32 reg_val, hi;

        switch (reg) {
+       case APIC_ID:
+               return __this_cpu_read(cpu_info.topo.initial_apicid) << 24;
        case APIC_EOI:
                rdmsr(HV_X64_MSR_EOI, reg_val, hi);
                (void)hi;
@@ -311,6 +313,7 @@ void __init hv_apic_init(void)
                 * both xapic and x2apic because the field layout is the same.
                 */
                apic_update_callback(eoi, hv_apic_eoi_write);
+               apic->apic_id_registered = NULL;
                if (!x2apic_enabled()) {
                        apic_update_callback(read, hv_apic_read);
                        apic_update_callback(write, hv_apic_write);

Setting apic->apic_id_registered to NULL is necessary because it does
read_apic_id() and checks that the value matches an APIC ID that was
registered when the MADT was parsed.  This test fails for some vCPUs
in the VM because the APIC IDs from the MADT are also sequential
with no gaps as mentioned above. I don't see any big hazard in
bypassing the check.

The hv_apic_read() override is used only in VMs with an xapic.
I still need to check a few things, but I believe Hyper-V gets
MADT and local APIC ID reg numbering correct when an x2apic
is used, so I don't think any hacks are needed for that path.

Does anyone have suggestions on a different way to handle
this that's better than the above diff?  Other thoughts?

Michael