On Fri, 17 Dec 2021 00:13:16 +0000 David Woodhouse <dwmw2@xxxxxxxxxxxxx> wrote: > On Thu, 2021-12-16 at 16:52 -0600, Tom Lendacky wrote: > > On baremetal, I haven't seen an issue. This only seems to have a problem > > with Qemu/KVM. > > > > With 191f08997577 I could boot without issues with and without the > > no_parallel_bringup. Only after I applied e78fa57dd642 did the failure happen. > > > > With e78fa57dd642 I could boot 64 vCPUs pretty consistently, but when I > > jumped to 128 vCPUs it failed again. When I moved the series to > > df9726cb7178, then 64 vCPUs also failed pretty consistently. > > > > Strange thing is it is random. Sometimes (rarely) it works on the first > > boot and then sometimes it doesn't, at which point it will reset and > > reboot 3 or 4 times and then make it past the failure and fully boot. > > Hm, some of that is just artifacts of timing, I'm sure. But now I'm that's most likely the case (there is a race somewhere left). To trigger CPU bringup (hotplug) races, I used to run QEMU guest with heavy vCPU overcommit. It helps to induce unexpected delays at CPU bringup time. > staring at the way that early_setup_idt() can run in parallel on all > CPUs, rewriting bringup_idt_descr and loading it. > > To start with, let's try unlocking the trampoline_lock much later, > after cpu_init_exception_handling() has loaded the real IDT. > > I think we can probably make secondaries load the real IDT early and > never use bringup_idt_descr at all, can't we? But let's see if this > makes it go away, to start with... > > diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c > index 0cd6373bc3f2..2307f7575ab4 100644 > --- a/arch/x86/kernel/cpu/common.c > +++ b/arch/x86/kernel/cpu/common.c > @@ -59,7 +59,7 @@ > #include <asm/cpu_device_id.h> > #include <asm/uv/uv.h> > #include <asm/sigframe.h> > - > +#include <asm/realmode.h> > #include "cpu.h" > > u32 elf_hwcap2 __read_mostly; > @@ -2060,6 +2060,7 @@ void cpu_init_secondary(void) > * on this CPU in cpu_init_exception_handling(). > */ > cpu_init_exception_handling(); > + clear_bit(0, (unsigned long *)trampoline_lock); > cpu_init(); > } > #endif > diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S > index 3e4c3c416bce..db01b56574cd 100644 > --- a/arch/x86/kernel/head_64.S > +++ b/arch/x86/kernel/head_64.S > @@ -273,14 +273,6 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL) > */ > movq initial_stack(%rip), %rsp > > - /* Drop the realmode protection. For the boot CPU the pointer is NULL! */ > - movq trampoline_lock(%rip), %rax > - testq %rax, %rax > - jz .Lsetup_idt > - lock > - btrl $0, (%rax) > - > -.Lsetup_idt: > /* Setup and Load IDT */ > pushq %rsi > call early_setup_idt