Re: Lockdep warnings on kexec (virtio_blk, hrtimers)

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Fri, 13 Dec 2024 09:31:11 +0000

On Fri, 2024-12-13 at 01:14 +0100, Thomas Gleixner wrote:
> 
> With that applied the problem goes away, but after a lot of repetitions
> of the reproducer in a tight loop the whole machinery stops dead:
> 
> [   29.913179] Disabling non-boot CPUs ...
> [   29.930328] smpboot: CPU 1 is now offline
> [   29.930593] crash hp: kexec_trylock() failed, kdump image may be inaccurate
> B[   29.940588] Enabling non-boot CPUs ...
> [   29.940856] crash hp: kexec_trylock() failed, kdump image may be inaccurate
> [   29.941242] smpboot: Booting Node 0 Processor 1 APIC 0x1
> [   29.942654] CPU1 is up
> [   29.945856] virtio_blk virtio1: 2/0/0 default/read/poll queues
> [   29.948556] OOM killer enabled.
> [   29.948750] Restarting tasks ... done.
> Success
> [   29.960044] Freezing user space processes
> [   29.961447] Freezing user space processes completed (elapsed 0.001 seconds)
> [   29.961861] OOM killer disabled.
> [   30.102485] ata2: found unknown device (class 0)
> [   30.107387] Disabling non-boot CPUs ...
> 
> That happens without 'no_console_suspend' on the command line as
> well, but that's for tomorrow ...

I think I saw that lockup once last night too. This morning I did not
see it after hundreds of invocations on my kexec-debug tree (based on
tip/x86/boot which is 6.13-rc1).

I switched to master (231825b2e1 still) and saw it after a few
attempts.

[   34.250006] Freezing user space processes
[   34.251930] Freezing user space processes completed (elapsed 0.001 seconds)
[   34.252730] OOM killer disabled.
[   34.253141] printk: Suspending console(s) (use no_console_suspend to debug)

(gdb) t a a bt

Thread 2 (Thread 1.2 (CPU#1 [halted ])):
#0  0xffffffff8235886f in pv_native_safe_halt () at arch/x86/kernel/paravirt.c:127
#1  0xffffffff8235b699 in arch_safe_halt () at ./arch/x86/include/asm/paravirt.h:175
#2  default_idle () at arch/x86/kernel/process.c:742
#3  0xffffffff8235bb0a in default_idle_call () at kernel/sched/idle.c:117
#4  0xffffffff81243195 in cpuidle_idle_call () at kernel/sched/idle.c:185
#5  do_idle () at kernel/sched/idle.c:325
#6  0xffffffff812434b9 in cpu_startup_entry (state=state@entry=CPUHP_AP_ONLINE_IDLE) at kernel/sched/idle.c:423
#7  0xffffffff8115b572 in start_secondary (unused=<optimized out>) at arch/x86/kernel/smpboot.c:314
#8  0xffffffff8110a38d in secondary_startup_64 () at arch/x86/kernel/head_64.S:414
#9  0x0000000000000000 in ?? ()

Thread 1 (Thread 1.1 (CPU#0 [halted ])):
#0  0xffffffff8235886f in pv_native_safe_halt () at arch/x86/kernel/paravirt.c:127
#1  0xffffffff8235b699 in arch_safe_halt () at ./arch/x86/include/asm/paravirt.h:175
#2  default_idle () at arch/x86/kernel/process.c:742
#3  0xffffffff8235bb0a in default_idle_call () at kernel/sched/idle.c:117
#4  0xffffffff81243195 in cpuidle_idle_call () at kernel/sched/idle.c:185
#5  do_idle () at kernel/sched/idle.c:325
#6  0xffffffff812434b9 in cpu_startup_entry (state=state@entry=CPUHP_ONLINE) at kernel/sched/idle.c:423
#7  0xffffffff8235c9c7 in rest_init () at init/main.c:747
#8  0xffffffff8419a694 in start_kernel () at init/main.c:1102
#9  0xffffffff841ac6a4 in x86_64_start_reservations (real_mode_data=real_mode_data@entry=0x147b0 <exception_stacks+34736> <error: Cannot access memory at address 0x147b0>) at arch/x86/kernel/head64.c:507
#10 0xffffffff841ac7fd in x86_64_start_kernel (real_mode_data=0x147b0 <exception_stacks+34736> <error: Cannot access memory at address 0x147b0>) at arch/x86/kernel/head64.c:488
#11 0xffffffff8110a38d in secondary_startup_64 () at arch/x86/kernel/head_64.S:414
#12 0x0000000000000000 in ?? ()
(gdb) 

But I haven't ingested your fix yet, maybe so I can't be entirely
surprised if CPU0 scheduled away and ended up in the idle thread? 

If I were cleverer I'd remember how to make gdb give me a backtrace for
the process which is actually in the kexec sys_reboot() system call,
instead of the boring idle thread.

(gdb) p sysrq_handle_showstate('t')

That didn't work. Maybe if I'd actually had no_console_suspend on this
boot. Will try again.
Attachment:
smime.p7s

Description: S/MIME cryptographic signature