We've been seeing variations of the following panic in production BUG: kernel NULL pointer dereference, address: 0000000000000000 RIP: 0010:ip6_pol_route+0x59/0x7a0 Call Trace: <IRQ> ? __die+0x78/0xc0 ? page_fault_oops+0x286/0x380 ? fib6_table_lookup+0x95/0xf40 ? exc_page_fault+0x5d/0x110 ? asm_exc_page_fault+0x22/0x30 ? ip6_pol_route+0x59/0x7a0 ? unlink_anon_vmas+0x370/0x370 fib6_rule_lookup+0x56/0x1b0 ? update_blocked_averages+0x2c6/0x6a0 ip6_route_output_flags+0xd2/0x130 ip6_dst_lookup_tail+0x3b/0x220 ip6_dst_lookup_flow+0x2c/0x80 inet6_sk_rebuild_header+0x14c/0x1e0 ? tcp_release_cb+0x150/0x150 __tcp_retransmit_skb+0x68/0x6b0 ? tcp_current_mss+0xca/0x150 ? tcp_release_cb+0x150/0x150 tcp_send_loss_probe+0x8e/0x220 tcp_write_timer+0xbe/0x2d0 run_timer_softirq+0x272/0x840 ? hrtimer_interrupt+0x2c9/0x5f0 ? sched_clock_cpu+0xc/0x170 irq_exit_rcu+0x171/0x330 sysvec_apic_timer_interrupt+0x6d/0x80 </IRQ> <TASK> asm_sysvec_apic_timer_interrupt+0x16/0x20 RIP: 0010:cpuidle_enter_state+0xe7/0x243 Inspecting the vmcore with drgn you can see why this is a NULL pointer deref >>> prog.crashed_thread().stack_trace()[0] #0 at 0xffffffff810bfa89 (ip6_pol_route+0x59/0x796) in ip6_pol_route at net/ipv6/route.c:2212:40 2212 if (net->ipv6.devconf_all->forwarding == 0) 2213 strict |= RT6_LOOKUP_F_REACHABLE; >>> prog.crashed_thread().stack_trace()[0]['net'].ipv6.devconf_all (struct ipv6_devconf *)0x0 Looking at the socket you can see that it's been closed >>> decode_enum_type_flags(prog.crashed_thread().stack_trace()[11]['sk'].__sk_common.skc_flags, prog.type('enum sock_flags')) 'SOCK_DEAD|SOCK_KEEPOPEN|SOCK_ZAPPED|SOCK_USE_WRITE_QUEUE' >>> decode_enum_type_flags(1 << prog.crashed_thread().stack_trace()[11]['sk'].__sk_common.skc_state.value_(), prog["TCPF_CLOSE"].type_, bit_numbers=False) 'TCPF_FIN_WAIT1' This occurs in our container setup where we have an NFS mount that belongs to the containers network namespace. On container shutdown our netns goes away, which sets net->ipv6.defconf_all = NULL, and then we panic. In the kernel we're responsible for destroying our sockets when the network namespace exits, or holding a reference on the network namespace for our sockets so this doesn't happen. Even once we shutdown the socket we can still have TCP timers that fire in the background, hence this panic. SUNRPC shuts down the socket and throws away all knowledge of it, but it's still doing things in the background. Fix this by grabbing a reference on the network namespace for any tcp sockets we open. With this patch I'm able to cycle my 500 node stress tier over and over again without panicing, whereas previously I was losing 10-20 nodes every shutdown cycle. Signed-off-by: Josef Bacik <josef@xxxxxxxxxxxxxx> --- net/sunrpc/xprtsock.c | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c index bb81050c870e..f02387751a94 100644 --- a/net/sunrpc/xprtsock.c +++ b/net/sunrpc/xprtsock.c @@ -2333,6 +2333,7 @@ static int xs_tcp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock) if (!transport->inet) { struct sock *sk = sock->sk; + struct net *net = sock_net(sk); /* Avoid temporary address, they are bad for long-lived * connections such as NFS mounts. @@ -2350,7 +2351,26 @@ static int xs_tcp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock) tcp_sock_set_nodelay(sk); lock_sock(sk); + /* + * Because timers can fire after the fact we need to hold a + * reference on the netns for this socket. + */ + if (!sk->sk_net_refcnt) { + if (!maybe_get_net(net)) { + release_sock(sk); + return -ENOTCONN; + } + /* + * For kernel sockets we have a tracker put in place for + * the tracing, we need to free this to maintaine + * consistent tracking info. + */ + __netns_tracker_free(net, &sk->ns_tracker, false); + sk->sk_net_refcnt = 1; + netns_tracker_alloc(net, &sk->ns_tracker, GFP_KERNEL); + sock_inuse_add(net, 1); + } xs_save_old_callbacks(transport, sk); sk->sk_user_data = xprt; -- 2.43.0