Re: disk-io lockup in 4.14.13 kernel

Bart Van Assche <Bart.VanAssche@xxxxxxx> · Mon, 26 Mar 2018 22:56:21 +0000

On Sat, 2018-03-24 at 23:38 +0200, Jaco Kroon wrote:
> Does the following go with your theory:
> 
> [452545.945561] sysrq: SysRq : Show backtrace of all active CPUs
> [452545.946182] NMI backtrace for cpu 5
> [452545.946185] CPU: 5 PID: 31921 Comm: bash Tainted: G          I    
> 4.14.13-uls #2
> [452545.946186] Hardware name: Supermicro
> SSG-5048R-E1CR36L/X10SRH-CLN4F, BIOS T20140520103247 05/20/2014
> [452545.946187] Call Trace:
> [452545.946196]  dump_stack+0x46/0x5a
> [452545.946200]  nmi_cpu_backtrace+0xb3/0xc0
> [452545.946205]  ? irq_force_complete_move+0xd0/0xd0
> [452545.946208]  nmi_trigger_cpumask_backtrace+0x8f/0xc0
> [452545.946212]  __handle_sysrq+0xec/0x140
> [452545.946216]  write_sysrq_trigger+0x26/0x30
> [452545.946219]  proc_reg_write+0x38/0x60
> [452545.946222]  __vfs_write+0x1e/0x130
> [452545.946225]  vfs_write+0xab/0x190
> [452545.946228]  SyS_write+0x3d/0xa0
> [452545.946233]  entry_SYSCALL_64_fastpath+0x13/0x6c
> [452545.946236] RIP: 0033:0x7f6b85db52d0
> [452545.946238] RSP: 002b:00007fff6f9479e8 EFLAGS: 00000246
> [452545.946241] Sending NMI from CPU 5 to CPUs 0-4:
> [452545.946272] NMI backtrace for cpu 0 skipped: idling at pc
> 0xffffffff8162b0a0
> [452545.946275] NMI backtrace for cpu 3 skipped: idling at pc
> 0xffffffff8162b0a0
> [452545.946279] NMI backtrace for cpu 4 skipped: idling at pc
> 0xffffffff8162b0a0
> [452545.946283] NMI backtrace for cpu 2 skipped: idling at pc
> 0xffffffff8162b0a0
> [452545.946287] NMI backtrace for cpu 1 skipped: idling at pc
> 0xffffffff8162b0a0
> 
> I'm not sure how to link that address back to some function or
> something, and had to reboot, so not sure if that can be done still.

Hello Jaco,

The above call trace means that SysRq-l was triggered, either via the keyboard
or through procfs. I don't think that there is any information in the above
that reveals the root cause of why a reboot was necessary.

What I do myself to identify the root cause of weird kernel behavior is to
rebuild the kernel with a bunch of debugging options enabled and that I try to
repeat the trigger that caused the weird behavior. If this causes the kernel
debugging code to produce additional output that output can be very helpful for
identifying what is going on. This approach does not always work however.

Bart.