Re: 5.4.13-rt7 stall on CPU?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello Sebastian, all,

I found another one of these in my logs:

Jul 19 05:33:05 vuurmuur named[2363288]: dispatch 0x7f1e3d7fa6c0:
shutting down due to TCP receive error: 2001:500:200::b#53: connection reset
Jul 20 02:20:24 vuurmuur kernel: 002: rcu: INFO: rcu_preempt
self-detected stall on CPU
Jul 20 02:20:24 vuurmuur kernel: 002: rcu:      2-....: (5250 ticks this
GP) idle=5c6/1/0x4000000000000002 softirq=401431806/401431806 fqs=2372
Jul 20 02:20:24 vuurmuur kernel: 002:   (t=5250 jiffies g=701989125 q=336)
Jul 20 02:20:24 vuurmuur kernel: 002: NMI backtrace for cpu 2
Jul 20 02:20:24 vuurmuur kernel: 002: CPU: 2 PID: 3468730 Comm: ntpd
Tainted: G           O      5.4.13-rt7 #9
Jul 20 02:20:24 vuurmuur kernel: 002: Hardware name: To Be Filled By
O.E.M. To Be Filled By O.E.M./QC5000M-ITX/PH, BIOS P1.10 05/06/2015
Jul 20 02:20:24 vuurmuur kernel: 002: Call Trace:
Jul 20 02:20:24 vuurmuur kernel: 002:  <IRQ>
Jul 20 02:20:24 vuurmuur kernel: 002:  dump_stack+0x50/0x70
Jul 20 02:20:24 vuurmuur kernel: 002:  nmi_cpu_backtrace.cold+0x14/0x53
Jul 20 02:20:24 vuurmuur kernel: 002:  ? lapic_can_unplug_cpu.cold+0x3b/0x3b
Jul 20 02:20:24 vuurmuur kernel: 002:
nmi_trigger_cpumask_backtrace+0x8e/0xa2
Jul 20 02:20:24 vuurmuur kernel: 002:  rcu_dump_cpu_stacks+0x8b/0xb9
Jul 20 02:20:24 vuurmuur kernel: 002:  rcu_sched_clock_irq.cold+0x17e/0x4fd
Jul 20 02:20:24 vuurmuur kernel: 002:  ? account_system_index_time+0xa6/0xd0
Jul 20 02:20:24 vuurmuur kernel: 002:  update_process_times+0x1f/0x50
Jul 20 02:20:24 vuurmuur kernel: 002:  tick_sched_timer+0x5a/0x1c0
Jul 20 02:20:24 vuurmuur kernel: 002:  ?
tick_switch_to_oneshot.cold+0x74/0x74
Jul 20 02:20:24 vuurmuur kernel: 002:  __hrtimer_run_queues+0xba/0x1b0
Jul 20 02:20:24 vuurmuur kernel: 002:  hrtimer_interrupt+0x108/0x230
Jul 20 02:20:24 vuurmuur kernel: 002:  smp_apic_timer_interrupt+0x61/0xa0
Jul 20 02:20:24 vuurmuur kernel: 002:  apic_timer_interrupt+0xf/0x20
Jul 20 02:20:24 vuurmuur kernel: 002:  </IRQ>
Jul 20 02:20:24 vuurmuur kernel: 002: RIP: 0010:lock_release+0x114/0x1a0
Jul 20 02:20:24 vuurmuur kernel: 002: Code: 0a 50 6e 01 75 0e 44 2b 7c
24 08 44 39 bd a8 07 00 00 75 54 65 48 8b 04 25 00 5e 01 00 c7 80 ac 07
00 00 00 00 00 00 41 56 9d <48> 8b 44 24 10 65 48 33 04 25 28 00 00 00
75 73 48 8b 6c 24 18 4c
Jul 20 02:20:24 vuurmuur kernel: 002: RSP: 0018:ffffc90001ebb8f0 EFLAGS:
00000246 ORIG_RAX: ffffffffffffff13
Jul 20 02:20:24 vuurmuur kernel: 002: RAX: ffff888054294140 RBX:
000000000000001f RCX: ffffc90001ebb8fc
Jul 20 02:20:24 vuurmuur kernel: 002: RDX: 0000000000000000 RSI:
ffffffff8203bee0 RDI: ffff8880542948f0
Jul 20 02:20:24 vuurmuur kernel: 002: RBP: ffff888054294140 R08:
0000000000020019 R09: 0000000000000000
Jul 20 02:20:24 vuurmuur kernel: 002: R10: 0000000000000001 R11:
0000000000000000 R12: ffffffff8203bee0
Jul 20 02:20:24 vuurmuur kernel: 002: R13: ffffffff8117b0a2 R14:
0000000000000246 R15: 0000000000000001
Jul 20 02:20:24 vuurmuur kernel: 002:  ? do_select+0x132/0x7a0
Jul 20 02:20:24 vuurmuur kernel: 002:  do_select+0x14f/0x7a0
Jul 20 02:20:24 vuurmuur kernel: 002:  ?
select_estimate_accuracy+0x100/0x100
Jul 20 02:20:24 vuurmuur kernel: 002:  ? poll_select_finish+0x1d0/0x1d0
Jul 20 02:20:24 vuurmuur kernel: 002:  ? poll_select_finish+0x1d0/0x1d0
Jul 20 02:20:24 vuurmuur kernel: 002:  ? find_held_lock+0x2b/0x80
Jul 20 02:20:24 vuurmuur kernel: 002:  ? put_prev_task_rt+0x22/0x140
Jul 20 02:20:24 vuurmuur kernel: 002:  ? find_held_lock+0x2b/0x80
Jul 20 02:20:24 vuurmuur kernel: 002:  ? __schedule+0x435/0x4f0
Jul 20 02:20:24 vuurmuur kernel: 002:  ? _raw_spin_unlock_irq+0x1f/0x40
Jul 20 02:20:24 vuurmuur kernel: 002:  ? __schedule+0x435/0x4f0
Jul 20 02:20:24 vuurmuur kernel: 002:  ? preempt_schedule_irq+0x31/0x50
Jul 20 02:20:24 vuurmuur kernel: 002:  ? retint_kernel+0x1b/0x1b
Jul 20 02:20:24 vuurmuur kernel: 002:  ? find_held_lock+0x2b/0x80
Jul 20 02:20:24 vuurmuur kernel: 002:  ? core_sys_select+0x5c/0x380
Jul 20 02:20:24 vuurmuur kernel: 002:  core_sys_select+0x1d0/0x380
Jul 20 02:20:24 vuurmuur kernel: 002:  ? core_sys_select+0x5c/0x380
Jul 20 02:20:24 vuurmuur kernel: 002:  ? tty_ldisc_ref_wait+0x27/0x70
Jul 20 02:20:24 vuurmuur kernel: 002:  ? __ldsem_down_read_nested+0x5e/0x240
Jul 20 02:20:24 vuurmuur kernel: 002:  ? find_held_lock+0x2b/0x80
Jul 20 02:20:24 vuurmuur kernel: 002:  ? find_held_lock+0x2b/0x80
Jul 20 02:20:24 vuurmuur kernel: 002:  ? set_user_sigmask+0x62/0x90
Jul 20 02:20:24 vuurmuur kernel: 002:  __x64_sys_pselect6+0x141/0x190
Jul 20 02:20:24 vuurmuur kernel: 002:  ? _raw_spin_unlock_irq+0x1f/0x40
Jul 20 02:20:24 vuurmuur kernel: 002:  ? sigprocmask+0x6d/0x90
Jul 20 02:20:24 vuurmuur kernel: 002:  do_syscall_64+0x77/0x440
Jul 20 02:20:24 vuurmuur kernel: 002:  ? nmi_handle+0xc1/0xe0
Jul 20 02:20:24 vuurmuur kernel: 002:  ? unregister_nmi_handler+0xb0/0xb0
Jul 20 02:20:24 vuurmuur kernel: 002:
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 20 02:20:24 vuurmuur kernel: 002: RIP: 0033:0x7fba5ae5a096
Jul 20 02:20:24 vuurmuur kernel: 002: Code: e8 9f dd f8 ff 4c 8b 0c 24
4c 8b 44 24 08 89 c5 4c 8b 54 24 28 48 8b 54 24 20 b8 0e 01 00 00 48 8b
74 24 18 8b 7c 24 14 0f 05 <48> 3d 00 f0 ff ff 77 28 89 ef 89 04 24 e8
c8 dd f8 ff 8b 04 24 eb
Jul 20 02:20:24 vuurmuur kernel: 002: RSP: 002b:00007ffcff164a10 EFLAGS:
00000293 ORIG_RAX: 000000000000010e
Jul 20 02:20:24 vuurmuur kernel: 002: RAX: ffffffffffffffda RBX:
00005613a62cde78 RCX: 00007fba5ae5a096
Jul 20 02:20:24 vuurmuur kernel: 002: RDX: 0000000000000000 RSI:
00007ffcff164b00 RDI: 000000000000001f
Jul 20 02:20:24 vuurmuur kernel: 002: RBP: 0000000000000000 R08:
0000000000000000 R09: 00007ffcff164a50
Jul 20 02:20:24 vuurmuur kernel: 002: R10: 0000000000000000 R11:
0000000000000293 R12: 00005613a62a9c43
Jul 20 02:20:24 vuurmuur kernel: 002: R13: 0000000000000009 R14:
ffffffffffffffff R15: 00005613a62a9e1c
Jul 20 02:20:24 vuurmuur kernel: rcu: INFO: rcu_preempt self-detected
stall on CPU
Jul 20 02:20:24 vuurmuur kernel: rcu: #0112-....: (5250 ticks this GP)
idle=5c6/1/0x4000000000000002 softirq=401431806/401431806 fqs=2372
Jul 20 02:20:24 vuurmuur kernel: #011(t=5250 jiffies g=701989125 q=336)
Jul 20 02:20:24 vuurmuur kernel: NMI backtrace for cpu 2
Jul 20 02:20:24 vuurmuur kernel: CPU: 2 PID: 3468730 Comm: ntpd Tainted:
G           O      5.4.13-rt7 #9

Jul 20 02:20:24 vuurmuur kernel: Hardware name: To Be Filled By O.E.M.
To Be Filled By O.E.M./QC5000M-ITX/PH, BIOS P1.10 05/06/2015
Jul 20 02:20:24 vuurmuur kernel: Call Trace:
Jul 20 02:20:24 vuurmuur kernel: <IRQ>
Jul 20 02:20:24 vuurmuur kernel: dump_stack+0x50/0x70
Jul 20 02:20:24 vuurmuur kernel: nmi_cpu_backtrace.cold+0x14/0x53
Jul 20 02:20:24 vuurmuur kernel: ? lapic_can_unplug_cpu.cold+0x3b/0x3b
Jul 20 02:20:24 vuurmuur kernel: nmi_trigger_cpumask_backtrace+0x8e/0xa2
Jul 20 02:20:24 vuurmuur kernel: rcu_dump_cpu_stacks+0x8b/0xb9
Jul 20 02:20:24 vuurmuur kernel: rcu_sched_clock_irq.cold+0x17e/0x4fd
Jul 20 02:20:24 vuurmuur kernel: ? account_system_index_time+0xa6/0xd0
Jul 20 02:20:24 vuurmuur kernel: update_process_times+0x1f/0x50
Jul 20 02:20:24 vuurmuur kernel: tick_sched_timer+0x5a/0x1c0
Jul 20 02:20:24 vuurmuur kernel: ? tick_switch_to_oneshot.cold+0x74/0x74
Jul 20 02:20:24 vuurmuur kernel: __hrtimer_run_queues+0xba/0x1b0
Jul 20 02:20:24 vuurmuur kernel: hrtimer_interrupt+0x108/0x230
Jul 20 02:20:24 vuurmuur kernel: smp_apic_timer_interrupt+0x61/0xa0
Jul 20 02:20:24 vuurmuur kernel: apic_timer_interrupt+0xf/0x20
Jul 20 02:20:24 vuurmuur kernel: </IRQ>
Jul 20 02:20:24 vuurmuur kernel: RIP: 0010:lock_release+0x114/0x1a0
Jul 20 02:20:24 vuurmuur kernel: Code: 0a 50 6e 01 75 0e 44 2b 7c 24 08
44 39 bd a8 07 00 00 75 54 65 48 8b 04 25 00 5e 01 00 c7 80 ac 07 00 00
00 00 00 00 41 56 9d <48> 8b 44 24 10 65 48 33 04 25 28 00 00 00 75 73
48 8b 6c 24 18 4c
Jul 20 02:20:24 vuurmuur kernel: RSP: 0018:ffffc90001ebb8f0 EFLAGS:
00000246 ORIG_RAX: ffffffffffffff13
Jul 20 02:20:24 vuurmuur kernel: RAX: ffff888054294140 RBX:
000000000000001f RCX: ffffc90001ebb8fc
Jul 20 02:20:24 vuurmuur kernel: RDX: 0000000000000000 RSI:
ffffffff8203bee0 RDI: ffff8880542948f0
Jul 20 02:20:24 vuurmuur kernel: RBP: ffff888054294140 R08:
0000000000020019 R09: 0000000000000000
Jul 20 02:20:24 vuurmuur kernel: R10: 0000000000000001 R11:
0000000000000000 R12: ffffffff8203bee0
Jul 20 02:20:24 vuurmuur kernel: R13: ffffffff8117b0a2 R14:
0000000000000246 R15: 0000000000000001
Jul 20 02:20:24 vuurmuur kernel: ? do_select+0x132/0x7a0
Jul 20 02:20:24 vuurmuur kernel: do_select+0x14f/0x7a0
Jul 20 02:20:24 vuurmuur kernel: ? select_estimate_accuracy+0x100/0x100
Jul 20 02:20:24 vuurmuur kernel: ? poll_select_finish+0x1d0/0x1d0
Jul 20 02:20:24 vuurmuur kernel: ? poll_select_finish+0x1d0/0x1d0
Jul 20 02:20:24 vuurmuur kernel: ? find_held_lock+0x2b/0x80
Jul 20 02:20:24 vuurmuur kernel: ? put_prev_task_rt+0x22/0x140
Jul 20 02:20:24 vuurmuur kernel: ? find_held_lock+0x2b/0x80
Jul 20 02:20:24 vuurmuur kernel: ? __schedule+0x435/0x4f0
Jul 20 02:20:24 vuurmuur kernel: ? _raw_spin_unlock_irq+0x1f/0x40
Jul 20 02:20:24 vuurmuur kernel: ? __schedule+0x435/0x4f0
Jul 20 02:20:24 vuurmuur kernel: ? preempt_schedule_irq+0x31/0x50
Jul 20 02:20:24 vuurmuur kernel: ? retint_kernel+0x1b/0x1b
Jul 20 02:20:24 vuurmuur kernel: ? find_held_lock+0x2b/0x80
Jul 20 02:20:24 vuurmuur kernel: ? core_sys_select+0x5c/0x380
Jul 20 02:20:24 vuurmuur kernel: core_sys_select+0x1d0/0x380
Jul 20 02:20:24 vuurmuur kernel: ? core_sys_select+0x5c/0x380
Jul 20 02:20:24 vuurmuur kernel: ? tty_ldisc_ref_wait+0x27/0x70
Jul 20 02:20:24 vuurmuur kernel: ? __ldsem_down_read_nested+0x5e/0x240
Jul 20 02:20:24 vuurmuur kernel: ? find_held_lock+0x2b/0x80
Jul 20 02:20:24 vuurmuur kernel: ? find_held_lock+0x2b/0x80
Jul 20 02:20:24 vuurmuur kernel: ? set_user_sigmask+0x62/0x90
Jul 20 02:20:24 vuurmuur kernel: __x64_sys_pselect6+0x141/0x190
Jul 20 02:20:24 vuurmuur kernel: ? _raw_spin_unlock_irq+0x1f/0x40
Jul 20 02:20:24 vuurmuur kernel: ? sigprocmask+0x6d/0x90
Jul 20 02:20:24 vuurmuur kernel: do_syscall_64+0x77/0x440
Jul 20 02:20:24 vuurmuur kernel: ? nmi_handle+0xc1/0xe0
Jul 20 02:20:24 vuurmuur kernel: ? unregister_nmi_handler+0xb0/0xb0
Jul 20 02:20:24 vuurmuur kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 20 02:20:24 vuurmuur kernel: RIP: 0033:0x7fba5ae5a096
Jul 20 02:20:24 vuurmuur kernel: Code: e8 9f dd f8 ff 4c 8b 0c 24 4c 8b
44 24 08 89 c5 4c 8b 54 24 28 48 8b 54 24 20 b8 0e 01 00 00 48 8b 74 24
18 8b 7c 24 14 0f 05 <48> 3d 00 f0 ff ff 77 28 89 ef 89 04 24 e8 c8 dd
f8 ff 8b 04 24 eb
Jul 20 02:20:24 vuurmuur kernel: RSP: 002b:00007ffcff164a10 EFLAGS:
00000293 ORIG_RAX: 000000000000010e
Jul 20 02:20:24 vuurmuur kernel: RAX: ffffffffffffffda RBX:
00005613a62cde78 RCX: 00007fba5ae5a096
Jul 20 02:20:24 vuurmuur kernel: RDX: 0000000000000000 RSI:
00007ffcff164b00 RDI: 000000000000001f
Jul 20 02:20:24 vuurmuur kernel: RBP: 0000000000000000 R08:
0000000000000000 R09: 00007ffcff164a50
Jul 20 02:20:24 vuurmuur kernel: R10: 0000000000000000 R11:
0000000000000293 R12: 00005613a62a9c43
Jul 20 02:20:24 vuurmuur kernel: R13: 0000000000000009 R14:
ffffffffffffffff R15: 00005613a62a9e1c
Jul 20 05:03:01 vuurmuur named[2363288]: received control channel
command 'flush'

Again it appears to concern ntpd.
Still on 5.4.13-rt7.

Should I boot into a newer kernel to verify the issue is still there?

Kind regards,
Udo


On 07-07-2020 19:47, Sebastian Andrzej Siewior wrote:
> On 2020-07-04 05:43:32 [+0200], Udo van den Heuvel wrote:
>>> I *think* this happened within the loop in __fget_files(). This function
>>> is inlined by __fget_light() and the loop has a RCU-section so it would
>>> make sense.
>>> Do you run something at an elevated priority in the system? 
>>
>> ntpd (ntpsec) elevates itself.
> 
> To the highest possible. Wonderful.
> 
>>> I don't know
>>> what the other part was doing but somehow one of the file descriptors
>>> (network sockets probably) was about to be closed while the other side
>>> tried to poll() on it.
>>
>> Thanks for explaining.
> 
> I'm puzzled. On dup() and otherwise on replacement / closing the file
> pointer is replaced (with an incremented reference count or NULL ) and
> then the reference count of the old file pointer is decremented. So the
> zero count could be observed once and then the new pointer should be
> seen. So it should not loop for seconds. And then, dup2() is only used
> at startup.




[Index of Archives]     [RT Stable]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]

  Powered by Linux