Re: kernel panics with 4.14.X versions

Guillaume Morin <guillaume@xxxxxxxxxxx> · Mon, 16 Apr 2018 18:06:31 +0200

On 16 Apr 16:40, Jan Kara wrote:
> Can you please run RIP through ./scripts/faddr2line to see where exactly
> are we looping? I expect the loop iterating over marks to notify but better
> be sure.
> 
> How easily can you hit this? Are you able to run debug kernels / inspect
> crash dumps when the issue occurs? Also testing with the latest mainline
> kernel (4.16) would be welcome whether this isn't just an issue with the
> backport of fsnotify fixes from Miklos.

I do have one proper kernel crash dump for one of the lockups we saw

PID: 30407  TASK: ffff9584913b2180  CPU: 8   COMMAND: "python"
 #0 [ffff959cb7883d80] machine_kexec at ffffffff890561ff
 #1 [ffff959cb7883dd8] __crash_kexec at ffffffff890f6dde
 #2 [ffff959cb7883e90] panic at ffffffff89074f03
 #3 [ffff959cb7883f10] watchdog_timer_fn at ffffffff89117388
 #4 [ffff959cb7883f40] __hrtimer_run_queues at ffffffff890dc65c
 #5 [ffff959cb7883f88] hrtimer_interrupt at ffffffff890dcb76
 #6 [ffff959cb7883fd8] smp_apic_timer_interrupt at ffffffff89802f6a
 #7 [ffff959cb7883ff0] apic_timer_interrupt at ffffffff8980227d
--- <IRQ stack> ---
 #8 [ffffafa5c894f880] apic_timer_interrupt at ffffffff8980227d
    [exception RIP: unknown or invalid address]
    RIP: 0000000000000000  RSP: ffffffff8a696820  RFLAGS: 00000002
    RAX: ffff95908f520c20  RBX: 0000000000000000  RCX: 0000000000000000
    RDX: ffff959c83c4d000  RSI: 0000000000000000  RDI: ffffafa5c894f9f8
    RBP: 0000000053411000   R8: 0000000000000000   R9: ffff95908f520c48
    R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000001000
    R13: 0000000000001000  R14: 0000000000001000  R15: 0000000053410000
    ORIG_RAX: 0000000000000000  CS: 0000  SS: ffffffffffffff10
bt: WARNING: possibly bogus exception frame
 #9 [ffffafa5c894f928] fsnotify at ffffffff892293e7
#10 [ffffafa5c894f9e8] __fsnotify_parent at ffffffff89229686
#11 [ffffafa5c894fa48] __kernel_write at ffffffff891e9962
#12 [ffffafa5c894fa70] dump_emit at ffffffff892445af
#13 [ffffafa5c894faa8] elf_core_dump at ffffffff8923f546
#14 [ffffafa5c894fc60] do_coredump at ffffffff89244c3f
#15 [ffffafa5c894fda0] get_signal at ffffffff89083ed0
#16 [ffffafa5c894fe18] do_signal at ffffffff89028323
#17 [ffffafa5c894ff10] exit_to_usermode_loop at ffffffff8900308c
#18 [ffffafa5c894ff38] prepare_exit_to_usermode at ffffffff89003753
    RIP: 00007f69706935c3  RSP: 00007ffeb8c1b4a8  RFLAGS: 00010206
    RAX: 00007f686d200034  RBX: 00005591f24f0170  RCX: 00007f68cb800000
    RDX: 00007f696d200000  RSI: 0000000000000061  RDI: 00007f686d200034
    RBP: 00007f686d200010   R8: ffffffffffffffff   R9: 00000000000000ff
    R10: 00000000e0a9a400  R11: 0000000000000246  R12: 0000000100000000
    R13: 0000000100000000  R14: 0000000000000000  R15: 0000000000000083
    ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b

faddr2line gives "fsnotify at fs/notify/fsnotify.c:368" (it's a 4.14.22).  So
it does seem that you were right about the location.

This happens with systemd handling coredumps.  It's using fsnotify to learn
about new dumps.

Note that on this machine, the dumps are on a loop mount:
/dev/loop0 /usr/cores ext4 rw,relatime,data=ordered 0 0

-- 
Guillaume Morin <guillaume@xxxxxxxxxxx>