Re: v6.12-rc workqueue lockups

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 28 Oct 2024 16:35:35 +1100

On Thu, Oct 24, 2024 at 11:23:17PM +0100, John Garry wrote:
> On 24/10/2024 22:13, Dave Chinner wrote:
> > > > BTW, can you please share logs which would contain full stacktraces that
> > > > this softlockup reports produce? The attached dmesg is just from fresh
> > > > boot...  Thanks!
> > > > 
> > > thanks for getting back to me.
> > > 
> > > So I think that enabling /proc/sys/kernel/softlockup_all_cpu_backtrace is
> > > required there. Unfortunately my VM often just locks up without any sign of
> > > life.
> > Attach a "serial" console to the vm - add "console=ttyS0,115600" to
> > the kernel command line and add "-serial pty" to the qemu command
> > line. You can then attach something like minicom to the /dev/pts/X
> > device that qemu creates for the console output and capture
> > everything from initial boot right through to the softlockup traces
> > that are emitted...
> 
> I am using an OCI instance, so I can't change the qemu command line (as far
> as I know).
> 
> For this issue, the Cloud Shell locks up also. There are other console
> connection methods, which I can try.
> 
> BTW, earlier today I got this once when trying to recreate this issue:
> 
> [ 1549.241972] ------------[ cut here ]------------
> [ 1609.240236] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 1609.240243] rcu:     5-...!: (0 ticks this GP)
> idle=a8f4/1/0x4000000000000000 softirq=71287/71287 fqs=1
> [ 1609.240249] rcu:     (detected by 2, t=60004 jiffies, g=168077, q=10823
> ncpus=16)
> [ 1609.240252] Sending NMI from CPU 2 to CPUs 5:
> [ 1609.240277] NMI backtrace for cpu 5
> [ 1609.240281] CPU: 5 UID: 1002 PID: 8250 Comm: mysqld Tainted: G W
> 6.12.0-rc4-g556c97f2ecbf #40
> [ 1609.240286] Tainted: [W]=WARN
> [ 1609.240288] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> 1.5.1 06/16/2021
> [ 1609.240289] RIP: 0010:native_halt+0xe/0x20
> [ 1609.240296] Code: 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90
> 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 23 f1 17 01 f4 <e9>
> 28 c3 05 01 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90
> [ 1609.240298] RSP: 0018:ffffc0c8c71dbd20 EFLAGS: 00000046
> [ 1609.240301] RAX: 0000000000000003 RBX: ffff9ff73fab6580 RCX:
> 0000000000000008
> [ 1609.240303] RDX: ffff9ff7bffaf740 RSI: 0000000000000003 RDI:
> ffff9ff73fab6580
> [ 1609.240304] RBP: ffff9ff73f8b7440 R08: 0000000000000008 R09:
> 0000000000000074
> [ 1609.240306] R10: 0000000000000002 R11: 0000000000000000 R12:
> 0000000000000000
> [ 1609.240307] R13: 0000000000000001 R14: 0000000000000100 R15:
> 0000000000180000
> [ 1609.240311] FS:  00007f9e12600700(0000) GS:ffff9ff73f880000(0000)
> knlGS:0000000000000000
> [ 1609.240313] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1609.240315] CR2: 00007f9d63e00004 CR3: 0000001a0bc04005 CR4:
> 0000000000770ef0
> [ 1609.240319] PKRU: 55555554
> [ 1609.240320] Call Trace:
> [ 1609.240322]  <NMI>
> [ 1609.240325]  ? nmi_cpu_backtrace+0x98/0x110
> [ 1609.240330]  ? nmi_cpu_backtrace_handler+0x11/0x20
> [ 1609.240334]  ? nmi_handle+0x5c/0x150
> [ 1609.240339]  ? default_do_nmi+0x4e/0x120
> [ 1609.240343]  ? exc_nmi+0x137/0x1d0
> [ 1609.240347]  ? end_repeat_nmi+0xf/0x53
> [ 1609.240354]  ? native_halt+0xe/0x20
> [ 1609.240357]  ? native_halt+0xe/0x20
> [ 1609.240360]  ? native_halt+0xe/0x20
> [ 1609.240363]  </NMI>
> [ 1609.240364]  <TASK>
> [ 1609.240366]  kvm_wait+0x47/0x60
> [ 1609.240368]  __pv_queued_spin_lock_slowpath+0x255/0x370
> [ 1609.240373]  _raw_spin_lock+0x29/0x30
> [ 1609.240376]  raw_spin_rq_lock_nested+0x1c/0x80
> [ 1609.240381]  __task_rq_lock+0x3f/0xe0
> [ 1609.240384]  try_to_wake_up+0x3cf/0x640
> [ 1609.240387]  ? plist_del+0x63/0xc0
> [ 1609.240391]  wake_up_q+0x4d/0x90
> [ 1609.240394]  futex_wake+0x154/0x180
> [ 1609.240400]  do_futex+0xf8/0x1d0
> [ 1609.240404]  __x64_sys_futex+0x68/0x1c0
> [ 1609.240407]  ? restore_fpregs_from_fpstate+0x3c/0xa0
> [ 1609.240411]  do_syscall_64+0x62/0x170
> [ 1609.240416]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

Yup, I'm seeing random RCU stalls as well when running a 64p
VM under hard concurrent fstests load. The serial console output is
occasionally tripping RCU stall warnings, too.

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx