From: Thomas Gleixner <tglx@xxxxxxxxxxxxx> Sent: Monday, July 31, 2023 5:35 AM > > On Mon, Jul 31 2023 at 04:05, Michael Kelley wrote: > >> + /* > >> + * The initial invocation from early_identify_cpu() happens before > >> + * the APIC is mapped or X2APIC enabled. For establishing the > >> + * topology, that's not required. Use the initial APIC ID. > >> + */ > >> + if (early) > >> + c->topo.apicid = c->topo.initial_apicid; > >> + else > >> + c->topo.apicid = read_apic_id(); > > > > Using the value from the local APIC ID reg turns out to cause a problem in > > some Hyper-V VM configurations. If a VM has multiple L3 caches (probably > > due to multiple NUMA nodes) and the # of CPUs in the span of the L3 cache > > is not a power of 2, the APIC IDs for the CPUs in the span of the 1st L3 cache > > are sequential starting with 0. But then there is a gap before starting the > > APIC IDs for the CPUs in the span of the 2nd L3 cache. The gap is > > repeated if there are additional L3 caches. > > > > The CPUID instruction executed on a guest vCPU correctly reports the APIC > > IDs. However, the ACPI MADT assigns the APIC IDs sequentially with no > > gaps, and the guest firmware sets the APIC_ID register for each local APIC > > to match the MADT. When parse_topology() sets the apicid field based on > > reading the local APIC ID register, the value it sets is different from the > > initial_apicid value for CPUs in the span of the 2nd and subsequent L3 > > caches, because there's no gap in the APIC IDs read from the local APIC. > > Linux boots and runs, but the topology is set up with the wrong span for > > the L3 cache and for the associated scheduling domains. > > TBH. That's an insanity. MADT and the actual APIC ID determine the > topology. So the gaps should be reflected in MADT and the actual APIC > IDs should be set correctly if the intent is to provide topology > information. > > Just for the record. This hack works only on Intel today, because AMD > init sets topo.apicid = read_apic_id() unconditionally. So this is > inconsistent already, no? > Correct. But given that the L3 cache span in the AMD Zen1 and Zen2 processors is only 8 CPUs, there's much less reason to configure a VM that only uses some of the CPUs in an L3 cache span. Hyper-V does the APIC ID numbering correctly for Zen3 with its 16 CPUs in the L3 cache span. > > The old code derives the apicid from the initial_apicid via the > > phys_pkg_id() callback, so these bad Hyper-V VM configs skate by. The > > wrong value in the local APIC ID register and MADT does not affect > > anything, except that the check in validate_apic_and_package_id() fails > > during boot, and a set of "Firmware bug:" messages is correctly > > output. > > So instead of fixing the firmware bugs, hyper-v just moves on and > pretends that everything works fine, right? What can I say. :-( > > > Three thoughts: > > > > 1) Are Hyper-V VMs the only place where the local APIC ID register might > > have a bogus value? Probably so, but you never know what might crawl > > out. > > Define bogus. MADT is the primary source of information because that's > how we know how many CPUs (APICs) are there and what their APIC ID is > which we can use to wake them up. So there is a reasonable expectation > that this information is consistent with the rest of the system. Commit d49597fd3bc7 "x86/cpu: Deal with broken firmware (VMWare/Xen)" mentions VMware and XEN implementations that violate the spec. The commit is from late 2016. Have these bad systems aged out and no longer need accommodation? > > The Intel SDM clearly says in Vol 3A section 9.4.5 Identifying Logical > Processors in an MP System: > > "After the BIOS has completed the MP initialization protocol, each > logical processor can be uniquely identified by its local APIC > ID. Software can access these APIC IDs in either of the following > ways:" > > These ways include read from APIC, read MADT, read CPUID and implies > that this must be consistent. For X2APIC it's actually written out: > > "If the local APIC unit supports x2APIC and is operating in x2APIC > mode, 32-bit APIC ID can be read by executing a RDMSR instruction to > read the processor’s x2APIC ID register. This method is equivalent to > executing CPUID leaf 0BH described below." > > AMD has not been following that in the early 64bit systems as they moved > the APIC ID space to start at 32 for the first CPU in the first socket > for whatever reasons. But since then the kernel reads back the APIC ID > on AMD systems into topo.apicid. But that was long ago and can easily be > dealt with because at least the real APIC ID and the MADT/MPTABLE > entries are consistent. > > Hypervisors have their own CPUID space to override functionality with > their own magic stuff, but imposing their nutbolt ideas on the > architectural part of the system is not only wrong, it's disrespectful > against the OS developers who try to keep their system sane. > > > 2) The natural response is "Well, fix Hyper-V!" I first had this conversation > > with the Hyper-V team about 5 years ago. Some cases of the problem were > > fixed, but some cases remain unfixed. It's a long story. > > > > 3) Since Hyper-V code in Linux already has an override for the apic->read() > > function, it's possible to do a hack in that override so that apicid gets set to > > the same value as initial_apicid, which matches the old code. Here's the diff: > > This collides massively with the other work I'm doing, which uses the > MADT provided information to actually evaluate various topology related > things upfront and later during bringup. Thats badly needed because lots > of todays infrastructure is based on heuristics and guesswork. Fair enough. And I've re-raised the issue with the Hyper-V team. > > But it seems I wasted a month on reworking all of this just to be > stopped cold in the tracks by completely undocumented and unnecessary > hyper-v abuse. > > So if Hyper-V insists on abusing the initial APIC ID as read from CPUID > for topology information related to L3, then hyper-v should override the > cache topology mechanism and not impose this insanity on the basic > topology evaluation infrastructure. > > Yours seriously grumpy > > tglx