On Thu, Mar 21 2024 at 07:06, Guenter Roeck wrote: > On 3/21/24 04:14, Thomas Gleixner wrote: >> If so can you please provide a full dmesg and then apply the patch below >> and provide the resulting full dmesg too? > > You'll find everything at http://server.roeck-us.net/qemu/x86-nosmp/ Thanks for providing this. > The crash is gone after applying your patch. The difference is: > > + /* > + * If there was no APIC registered, then the map check below would > + * fail. With no APIC this is guaranteed to be an UP system and > + * therefore all topology levels have only one entry and their > + * logical ID is obviously 0. > + */ > + if (topo_info.boot_cpu_apic_id == BAD_APICID) { > + pr_info("#### topo_info.boot_cpu_apic_id == BAD_APICID\n"); > ^^^^ I added this > + return 0; > + } > + > > I see the "#### topo_info.boot_cpu_apic_id == BAD_APICID" message > twice in the log. See patched.log at the page pointed to above. I can see why this is emitted. That happens on the initial CPUID evaluation of the boot CPU very early during boot. [ 0.000000] Command line: console=ttyS0 [ 0.000000] CPU topo: APIC logical ID: 0 0 6 [ 0.000000] CPU topo: #### topo_info.boot_cpu_apic_id == BAD_APICID [ 0.000000] CPU topo: APIC logical ID: 0 0 4 [ 0.000000] CPU topo: #### topo_info.boot_cpu_apic_id == BAD_APICID The later full CPUID evaluation happens after the ACPI enumeration and way before the affected RAPL driver is initialized: [ 0.088029] CPU topo: APIC logical ID: 0 0 6 [ 0.088084] CPU topo: APIC logical ID: 0 0 4 This invocation has the boot APIC registered as your extra print does not show up. ... [ 0.585850] RAPL PMU: API unit is 2^-32 Joules, 0 fixed counters, 10737418240 ms ovfl timer So even without that guard (which we need anyway for the non APIC case) topology_logical_die_id() == cpu_data(cpu).topo.logical_die_id must have the correct value in that RAPL initialization and CPU hotplug callback code. But our absolutely convoluted startup logic prevents that because: identify_cpu_early() operates on boot_cpu_data smp_prepare_boot_cpu() copies boot_cpu_data to per CPU cpu data identify_boot_cpu() operates on boot_cpu_data identify_boot_cpu() is the one which gets the correct logical die info, but that never gets copied over to the per CPU data instance on which the RAPL code and everything else works on. I'll cook up a patch later. Thanks for providing the info! tglx