Re: Recent spontaneous reboots on multiple machines

Sowmini Varadhan <sowmini.varadhan@xxxxxxxxxx> · Sun, 21 Feb 2016 20:02:01 -0500

On (02/15/16 07:54), Meelis Roos wrote (on sparclinux):
> > > It's getting more strange. I ran 4.4-rc8-00005 for 2-3 weeks nonstop, 
> > > doing git clone and make -j4 in a loop, on both V240 and V440. Worked 
> > > 100% stable.
> > > 
> > > Then I git git pull from kernel.org, tried to compile 4.5-rc1 (or was it 
> > > rc2 already), on the same running 4.4.0-rc8-00005 and it rebooted, on 
> > > both V240 and V440.

Hmm. My experience was a little different than yours but maybe we
are seeing the same thing. 

I get a panic that matches the description in d188ba86dd07a ("xfrm:
add rcu protection to sk->sk_policy[]") but the panic remains
even after applying that patch, so maybe there is still some
race-window that was missed by the patch (or I'm missing some additional
patches?)

To reproduce the panic on my v440 (sparc sunfire) I fixed up my transparent
proxy env, and do a 'git pull' on the test machine (running 4.4.0-rc3+).
The reboot on panic was quite noisy on the (serial line to) console, though I
didnt find anything recorded in /var/log/*, and, with
kernel.panic = kernel.panic_on_oops = 1, the ssh session terminates quietly.

here's what I pulled out from the console noise:

[3816414.196028] Unable to handle kernel paging request at virtual address 77e0000000000000
[3816414.302455] tsk->{mm,active_mm}->context = 0000000000001f95
[3816414.378057] tsk->{mm,active_mm}->pgd = fff000123c040000
   :
[3816414.651546] git(7768): Oops [#1]
[3816414.696158] CPU: 0 PID: 7768 Comm: git Not tainted 4.4.0-rc3-roos-00790-g264a4ac-dirty #29
[3816414.807133] task: fff000123e2a31e0 ti: fff000123e3dc000 task.ti: fff000123e3dc000
[3816414.907887] TSTATE: 0000009911001601 TPC: 00000000007ed400 TNPC: 00000000007ed404 Y: 00000276    Not tainted
[3816415.039484] TPC: <xfrm_selector_match+0x20/0x3a0>
                      :
                      :

Looks like the pol is the bad vaddr. When I insert printks, I see
the following in xfrm_sk_policy_lookup() 

   dir XFRM_POLICY_OUT  sk fff000123e1aa000 pol 77e0000000000000

Relevant parts of the stack trace from console messages  are shown below.

 xfrm_sk_policy_lookup+0x30/0xc0
 xfrm_lookup+0x20/0x340
 nf_xfrm_me_harder+0x54/0x120 [nf_nat]
 nf_nat_ipv4_out+0xe0/0x140 [nf_nat_ipv4]
 nf_iterate+0x8c/0xc0
 nf_hook_slow+0x1c/0xe0
 ip_output+0xd4/0x100
 ip_local_out+0x30/0x60
 tcp_v4_send_synack+0x4c/0xa0
 tcp_conn_request+0x934/0x960
 tcp_rcv_state_process+0x1dc/0xee0
 tcp_v4_do_rcv+0x68/0x220
 tcp_v4_rcv+0xb04/0xbc0
 ip_local_deliver_finish+0x114/0x2a0
 ip_local_deliver+0x38/0xe0
 ip_rcv_finish+0x14c/0x380
 ip_rcv+0x26c/0x3e0
 __netif_receive_skb_core+0x7c4/0xb60
 process_backlog+0x70/0x120
 net_rx_action+0x204/0x300
 __do_softirq+0xc4/0x200
 do_softirq_own_stack+0x2c/0x4
  etc.

Unfortunately I cannot get a crash dump on sunfire, so no way to tell 
what other kernel threads could potentially be racing with this.

Still looking..

--Sowmini
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html