On Fri, 2024-12-13 at 01:14 +0100, Thomas Gleixner wrote: > > With that applied the problem goes away, but after a lot of repetitions > of the reproducer in a tight loop the whole machinery stops dead: > > [ 29.913179] Disabling non-boot CPUs ... > [ 29.930328] smpboot: CPU 1 is now offline > [ 29.930593] crash hp: kexec_trylock() failed, kdump image may be inaccurate > B[ 29.940588] Enabling non-boot CPUs ... > [ 29.940856] crash hp: kexec_trylock() failed, kdump image may be inaccurate > [ 29.941242] smpboot: Booting Node 0 Processor 1 APIC 0x1 > [ 29.942654] CPU1 is up > [ 29.945856] virtio_blk virtio1: 2/0/0 default/read/poll queues > [ 29.948556] OOM killer enabled. > [ 29.948750] Restarting tasks ... done. > Success > [ 29.960044] Freezing user space processes > [ 29.961447] Freezing user space processes completed (elapsed 0.001 seconds) > [ 29.961861] OOM killer disabled. > [ 30.102485] ata2: found unknown device (class 0) > [ 30.107387] Disabling non-boot CPUs ... > > That happens without 'no_console_suspend' on the command line as > well, but that's for tomorrow ... I think I saw that lockup once last night too. This morning I did not see it after hundreds of invocations on my kexec-debug tree (based on tip/x86/boot which is 6.13-rc1). I switched to master (231825b2e1 still) and saw it after a few attempts. [ 34.250006] Freezing user space processes [ 34.251930] Freezing user space processes completed (elapsed 0.001 seconds) [ 34.252730] OOM killer disabled. [ 34.253141] printk: Suspending console(s) (use no_console_suspend to debug) (gdb) t a a bt Thread 2 (Thread 1.2 (CPU#1 [halted ])): #0 0xffffffff8235886f in pv_native_safe_halt () at arch/x86/kernel/paravirt.c:127 #1 0xffffffff8235b699 in arch_safe_halt () at ./arch/x86/include/asm/paravirt.h:175 #2 default_idle () at arch/x86/kernel/process.c:742 #3 0xffffffff8235bb0a in default_idle_call () at kernel/sched/idle.c:117 #4 0xffffffff81243195 in cpuidle_idle_call () at kernel/sched/idle.c:185 #5 do_idle () at kernel/sched/idle.c:325 #6 0xffffffff812434b9 in cpu_startup_entry (state=state@entry=CPUHP_AP_ONLINE_IDLE) at kernel/sched/idle.c:423 #7 0xffffffff8115b572 in start_secondary (unused=<optimized out>) at arch/x86/kernel/smpboot.c:314 #8 0xffffffff8110a38d in secondary_startup_64 () at arch/x86/kernel/head_64.S:414 #9 0x0000000000000000 in ?? () Thread 1 (Thread 1.1 (CPU#0 [halted ])): #0 0xffffffff8235886f in pv_native_safe_halt () at arch/x86/kernel/paravirt.c:127 #1 0xffffffff8235b699 in arch_safe_halt () at ./arch/x86/include/asm/paravirt.h:175 #2 default_idle () at arch/x86/kernel/process.c:742 #3 0xffffffff8235bb0a in default_idle_call () at kernel/sched/idle.c:117 #4 0xffffffff81243195 in cpuidle_idle_call () at kernel/sched/idle.c:185 #5 do_idle () at kernel/sched/idle.c:325 #6 0xffffffff812434b9 in cpu_startup_entry (state=state@entry=CPUHP_ONLINE) at kernel/sched/idle.c:423 #7 0xffffffff8235c9c7 in rest_init () at init/main.c:747 #8 0xffffffff8419a694 in start_kernel () at init/main.c:1102 #9 0xffffffff841ac6a4 in x86_64_start_reservations (real_mode_data=real_mode_data@entry=0x147b0 <exception_stacks+34736> <error: Cannot access memory at address 0x147b0>) at arch/x86/kernel/head64.c:507 #10 0xffffffff841ac7fd in x86_64_start_kernel (real_mode_data=0x147b0 <exception_stacks+34736> <error: Cannot access memory at address 0x147b0>) at arch/x86/kernel/head64.c:488 #11 0xffffffff8110a38d in secondary_startup_64 () at arch/x86/kernel/head_64.S:414 #12 0x0000000000000000 in ?? () (gdb) But I haven't ingested your fix yet, maybe so I can't be entirely surprised if CPU0 scheduled away and ended up in the idle thread? If I were cleverer I'd remember how to make gdb give me a backtrace for the process which is actually in the kexec sys_reboot() system call, instead of the boring idle thread. (gdb) p sysrq_handle_showstate('t') That didn't work. Maybe if I'd actually had no_console_suspend on this boot. Will try again.
Attachment:
smime.p7s
Description: S/MIME cryptographic signature