Re: RT kernel on Acer laptop unreliable

Jacek Konieczny <jajcus@xxxxxxxxxx> · Sun, 20 Aug 2017 16:48:24 +0200

On 2017-08-18 14:44, Sebastian Andrzej Siewior wrote:
> On 2017-08-14 20:03:26 [+0200], Jacek Konieczny wrote:
>> On 2017-08-07 15:19, Sebastian Andrzej Siewior wrote:
>>> One thing you could try, is to see if the latest v4.11 based RT kernel
>>> works more reliable.
>>
>> I tried 4.11.12-rt9 (with your patch applied, just in case). It worked
>> find for a day and then failed like the older kernels:
> 
> I see. Could you try to enable lockdep and see if it yell in dmesg?
> lockdep would be:
> 	CONFIG_DEBUG_RT_MUTEXES=y
> 	CONFIG_PROVE_LOCKING=y
> 	CONFIG_DEBUG_ATOMIC_SLEEP=y

Sure. I have recompiled the kernel with those settings, and got this:

======================================================
[ INFO: possible circular locking dependency detected ]
4.11.12-rt9-1 #3 Not tainted
-------------------------------------------------------
zabbix_agentd/1299 is trying to acquire lock:
 (&per_cpu(local_softirq_locks[i], __cpu).lock){+.+...}, at:
[<ffffffff8a06e1ed>] do_current_softirqs+0x14d/0x670

                              but task is already holding lock:
 ((tcp_sk_lock).lock){+.+...}, at: [<ffffffff8a65eab1>]
tcp_v4_send_reset+0x3b1/0x7d0

                              which lock already depends on the new lock.

                              the existing dependency chain (in reverse
order) is:

                              -> #1 ((tcp_sk_lock).lock){+.+...}:
       lock_acquire+0xb7/0x250
       rt_spin_lock+0x4b/0x60
       tcp_v4_send_reset+0x3b1/0x7d0
       tcp_v4_rcv+0x7c0/0xfb0
       ip_local_deliver_finish+0xe4/0x3d0
       ip_local_deliver+0x1a7/0x220
       ip_rcv_finish+0x222/0x6e0
       ip_rcv+0x3aa/0x540
       __netif_receive_skb_core+0x790/0xdc0
       __netif_receive_skb+0x1d/0x60
       process_backlog+0x9f/0x270
       net_rx_action+0x389/0x6c0
       do_current_softirqs+0x22e/0x670
       __local_bh_enable+0x5b/0x80
       ip_finish_output2+0x2aa/0x5e0
       ip_finish_output+0x229/0x320
       ip_output+0x182/0x260
       ip_local_out+0x39/0x70
       ip_queue_xmit+0x1e8/0x5e0
       tcp_transmit_skb+0x4ce/0x9e0
       tcp_connect+0x658/0x9f0
       tcp_v4_connect+0x56e/0x5b0
       __inet_stream_connect+0xb7/0x320
       inet_stream_connect+0x3b/0x60
       SyS_connect+0xe1/0x120
       do_syscall_64+0x7f/0x210
       return_from_SYSCALL_64+0x0/0x7a

                              -> #0 (&per_cpu(local_softirq_locks[i],
__cpu).lock){+.+...}:
       __lock_acquire+0x1b84/0x1d30
       lock_acquire+0xb7/0x250
       rt_spin_lock+0x4b/0x60
       do_current_softirqs+0x14d/0x670
       __local_bh_enable+0x5b/0x80
       tcp_v4_send_reset+0x48b/0x7d0
       tcp_v4_do_rcv+0x73/0x200
       __release_sock+0x86/0x160
       release_sock+0x35/0xc0
       inet_shutdown+0x86/0x100
       SyS_shutdown+0x84/0x90
       do_syscall_64+0x7f/0x210
       return_from_SYSCALL_64+0x0/0x7a

                              other info that might help us debug this:
 Possible unsafe locking scenario:
       CPU0                    CPU1
       ----                    ----
  lock((tcp_sk_lock).lock);
                               lock(&per_cpu(local_softirq_locks[i],
__cpu).lock);
                               lock((tcp_sk_lock).lock);
  lock(&per_cpu(local_softirq_locks[i], __cpu).lock);

                               *** DEADLOCK ***
3 locks held by zabbix_agentd/1299:
 #0:  (sk_lock-AF_INET){+.+.+.}, at: [<ffffffff8a67d45b>]
inet_shutdown+0x3b/0x100
 #1:  (rcu_read_lock){......}, at: [<ffffffff8a65e834>]
tcp_v4_send_reset+0x134/0x7d0
 #2:  ((tcp_sk_lock).lock){+.+...}, at: [<ffffffff8a65eab1>]
tcp_v4_send_reset+0x3b1/0x7d0

                              stack backtrace:
CPU: 1 PID: 1299 Comm: zabbix_agentd Not tainted 4.11.12-rt9-1 #3
Hardware name: Acer Aspire E5-575/Ironman_SK  , BIOS V1.25 03/03/2017
Call Trace:
 dump_stack+0x68/0x92
 print_circular_bug+0x1f6/0x300
 __lock_acquire+0x1b84/0x1d30
 ? preempt_count_sub+0xa1/0x100
 lock_acquire+0xb7/0x250
 ? lock_acquire+0xb7/0x250
 ? do_current_softirqs+0x14d/0x670
 rt_spin_lock+0x4b/0x60
 ? do_current_softirqs+0x14d/0x670
 do_current_softirqs+0x14d/0x670
 ? __local_bh_enable+0x23/0x80
 __local_bh_enable+0x5b/0x80
 tcp_v4_send_reset+0x48b/0x7d0
 ? tcp_rcv_state_process+0x28c/0xf20
 tcp_v4_do_rcv+0x73/0x200
 ? tcp_v4_do_rcv+0x73/0x200
 __release_sock+0x86/0x160
 release_sock+0x35/0xc0
 inet_shutdown+0x86/0x100
 SyS_shutdown+0x84/0x90
 do_syscall_64+0x7f/0x210
 entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7fbaa1dbbb57
RSP: 002b:00007ffd750de028 EFLAGS: 00000202 ORIG_RAX: 0000000000000030
RAX: ffffffffffffffda RBX: 00007ffd750de0b0 RCX: 00007fbaa1dbbb57
RDX: 00000000ffffffff RSI: 0000000000000002 RDI: 0000000000000007
RBP: 00000000006671f8 R08: 0000000000995de0 R09: 0000000000000000
R10: 00007fbaa206ebe8 R11: 0000000000000202 R12: 000000000044afb0
R13: 00000000006607dc R14: 00000000006615e0 R15: 00007ffd750de070

Does this help?

The system has not crashed yet, I may catch something more.

Jacek
--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html