RE: [patch v2 21/38] x86/cpu: Provide cpu_init/parse_topology()

Thomas Gleixner <tglx@xxxxxxxxxxxxx> · Mon, 31 Jul 2023 14:34:39 +0200

On Mon, Jul 31 2023 at 04:05, Michael Kelley wrote:
>> +	/*
>> +	 * The initial invocation from early_identify_cpu() happens before
>> +	 * the APIC is mapped or X2APIC enabled. For establishing the
>> +	 * topology, that's not required. Use the initial APIC ID.
>> +	 */
>> +	if (early)
>> +		c->topo.apicid = c->topo.initial_apicid;
>> +	else
>> +		c->topo.apicid = read_apic_id();
>
> Using the value from the local APIC ID reg turns out to cause a problem in
> some Hyper-V VM configurations.  If a VM has multiple L3 caches (probably
> due to multiple NUMA nodes) and the # of CPUs in the span of the L3 cache
> is not a power of 2, the APIC IDs for the CPUs in the span of the 1st L3 cache
> are sequential starting with 0.  But then there is a gap before starting the
> APIC IDs for the CPUs in the span of the 2nd L3 cache.  The gap is
> repeated if there are additional L3 caches.
>
> The CPUID instruction executed on a guest vCPU correctly reports the APIC
> IDs.  However, the ACPI MADT assigns the APIC IDs sequentially with no
> gaps, and the guest firmware sets the APIC_ID register for each local APIC
> to match the MADT.  When parse_topology() sets the apicid field based on
> reading the local APIC ID register, the value it sets is different from the
> initial_apicid value for CPUs in the span of the 2nd and subsequent L3
> caches, because there's no gap in the APIC IDs read from the local APIC.
> Linux boots and runs, but the topology is set up with the wrong span for
> the L3 cache and for the associated scheduling domains.

TBH. That's an insanity. MADT and the actual APIC ID determine the
topology. So the gaps should be reflected in MADT and the actual APIC
IDs should be set correctly if the intent is to provide topology
information.

Just for the record. This hack works only on Intel today, because AMD
init sets topo.apicid = read_apic_id() unconditionally. So this is
inconsistent already, no?

> The old code derives the apicid from the initial_apicid via the
> phys_pkg_id() callback, so these bad Hyper-V VM configs skate by.  The
> wrong value in the local APIC ID register and MADT does not affect
> anything, except that the check in validate_apic_and_package_id() fails
> during boot, and a set of "Firmware bug:" messages is correctly
> output.

So instead of fixing the firmware bugs, hyper-v just moves on and
pretends that everything works fine, right?

> Three thoughts:
>
> 1)  Are Hyper-V VMs the only place where the local APIC ID register might
> have a bogus value?  Probably so, but you never know what might crawl
> out.

Define bogus. MADT is the primary source of information because that's
how we know how many CPUs (APICs) are there and what their APIC ID is
which we can use to wake them up. So there is a reasonable expectation
that this information is consistent with the rest of the system.

The Intel SDM clearly says in Vol 3A section 9.4.5 Identifying Logical
Processors in an MP System:

  "After the BIOS has completed the MP initialization protocol, each
   logical processor can be uniquely identified by its local APIC
   ID. Software can access these APIC IDs in either of the following
   ways:"

These ways include read from APIC, read MADT, read CPUID and implies
that this must be consistent. For X2APIC it's actually written out:

  "If the local APIC unit supports x2APIC and is operating in x2APIC
   mode, 32-bit APIC ID can be read by executing a RDMSR instruction to
   read the processor’s x2APIC ID register. This method is equivalent to
   executing CPUID leaf 0BH described below."

AMD has not been following that in the early 64bit systems as they moved
the APIC ID space to start at 32 for the first CPU in the first socket
for whatever reasons. But since then the kernel reads back the APIC ID
on AMD systems into topo.apicid. But that was long ago and can easily be
dealt with because at least the real APIC ID and the MADT/MPTABLE
entries are consistent.

Hypervisors have their own CPUID space to override functionality with
their own magic stuff, but imposing their nutbolt ideas on the
architectural part of the system is not only wrong, it's disrespectful
against the OS developers who try to keep their system sane.

> 2) The natural response is "Well, fix Hyper-V!"  I first had this conversation
> with the Hyper-V team about 5 years ago.  Some cases of the problem were
> fixed, but some cases remain unfixed.  It's a long story.
>
> 3)  Since Hyper-V code in Linux already has an override for the apic->read()
> function, it's possible to do a hack in that override so that apicid gets set to
> the same value as initial_apicid, which matches the old code.  Here's the diff:

This collides massively with the other work I'm doing, which uses the
MADT provided information to actually evaluate various topology related
things upfront and later during bringup. Thats badly needed because lots
of todays infrastructure is based on heuristics and guesswork.

But it seems I wasted a month on reworking all of this just to be
stopped cold in the tracks by completely undocumented and unnecessary
hyper-v abuse.

So if Hyper-V insists on abusing the initial APIC ID as read from CPUID
for topology information related to L3, then hyper-v should override the
cache topology mechanism and not impose this insanity on the basic
topology evaluation infrastructure.

Yours seriously grumpy

      tglx