Kernel panic 3.18 - 4.0.1

Alexandr Morozov <alexandr.morozov@xxxxxxxxxx> · Tue, 26 May 2015 14:14:24 -0700

We encountered kernel panic in our tests. We think it is because we
introduced bind-mounted network namespaces, so basically for each
container we doing unshare(NEWNET), bindmount it to path and then
configure it and setns on it. Here is trace which I get on 4.0.1:
May 26 13:37:26 minigrind kernel: BUG: unable to handle kernel NULL
pointer dereference at 0000000000000016
May 26 13:37:26 minigrind kernel: IP: [<ffffffff811d4683>]
__detach_mounts+0x33/0x80
May 26 13:37:26 minigrind kernel: PGD 31aef9067 PUD 2b5ed8067 PMD 0
May 26 13:37:26 minigrind kernel: Oops: 0000 [#1] PREEMPT SMP
May 26 13:37:26 minigrind kernel: Modules linked in: ipt_MASQUERADE
nf_nat_masquerade_ipv4 bridge stp llc overlay ip6t_REJECT
nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ebtable_nat ebtab
May 26 13:37:26 minigrind kernel: CPU: 0 PID: 4078 Comm: docker Not
tainted 4.0.1-gentoo #1
May 26 13:37:26 minigrind kernel: Hardware name: LENOVO
20AQ006HUS/20AQ006HUS, BIOS GJET77WW (2.27 ) 05/20/2014
May 26 13:37:26 minigrind kernel: task: ffff8802b5e39980 ti:
ffff88008bfbc000 task.ti: ffff88008bfbc000
May 26 13:37:26 minigrind kernel: RIP: 0010:[<ffffffff811d4683>]
[<ffffffff811d4683>] __detach_mounts+0x33/0x80
May 26 13:37:26 minigrind kernel: RSP: 0018:ffff88008bfbfe38  EFLAGS: 00010202
May 26 13:37:26 minigrind kernel: RAX: 000000000000b9b9 RBX:
fffffffffffffffe RCX: 00000000000000b9
May 26 13:37:26 minigrind kernel: RDX: ffff8802b5e39980 RSI:
ffffffff819a10cd RDI: 0000000000000000
May 26 13:37:26 minigrind kernel: RBP: ffff880327fbe480 R08:
0000000000000000 R09: 0000000000000000
May 26 13:37:26 minigrind kernel: R10: ffff88033e2197e0 R11:
0000000000000000 R12: ffff88007dde8a78
May 26 13:37:26 minigrind kernel: R13: ffff88007dde8ea8 R14:
ffff88008bfbfea0 R15: ffff88007dde8f40
May 26 13:37:26 minigrind kernel: FS:  00007f7421b0a700(0000)
GS:ffff88033e200000(0000) knlGS:0000000000000000
May 26 13:37:26 minigrind kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
0000000080050033
May 26 13:37:26 minigrind kernel: CR2: 0000000000000016 CR3:
000000031702b000 CR4: 00000000001406f0
May 26 13:37:26 minigrind kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
May 26 13:37:26 minigrind kernel: DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
May 26 13:37:26 minigrind kernel: Stack:
May 26 13:37:26 minigrind kernel:  ffff880327fbe4d8 ffffffff811bfc82
00000000014007f0 00000000fffffffe
May 26 13:37:26 minigrind kernel:  ffff88031724d000 0000000000000000
ffff88008bfbfeb8 ffff88007dde8ea8
May 26 13:37:26 minigrind kernel:  00000000ffffff9c ffffffff811c4ec8
000000c20858d5f0 ffff880327fbe480
May 26 13:37:26 minigrind kernel: Call Trace:
May 26 13:37:26 minigrind kernel:  [<ffffffff811bfc82>] ? vfs_unlink+0x172/0x180
May 26 13:37:26 minigrind kernel:  [<ffffffff811c4ec8>] ?
do_unlinkat+0x268/0x2d0
May 26 13:37:26 minigrind kernel:  [<ffffffff8104bdb5>] ?
syscall_trace_enter_phase1+0x195/0x1a0
May 26 13:37:26 minigrind kernel:  [<ffffffff81746216>] ?
int_check_syscall_exit_work+0x34/0x3d
May 26 13:37:26 minigrind kernel:  [<ffffffff81745ff6>] ?
system_call_fastpath+0x16/0x1b
May 26 13:37:26 minigrind kernel: Code: 62 c3 81 e8 b0 fc 56 00 48 89
df e8 18 da ff ff 48 85 c0 48 89 c3 74 55 48 c7 c7 84 b4 c0 81 e8 a4
0f 57 00 83 05 fd 6d a3 00 01 <48> 8b 53 18 48 85 d2
May 26 13:37:26 minigrind kernel: RIP  [<ffffffff811d4683>]
__detach_mounts+0x33/0x80
May 26 13:37:26 minigrind kernel:  RSP <ffff88008bfbfe38>
May 26 13:37:26 minigrind kernel: CR2: 0000000000000016
May 26 13:37:26 minigrind kernel: ---[ end trace 399f937a2cba4abb ]---

On 4.0.2 all is perfect for me.
My colleagues got different errors, like rcu_stall and just deadlock
when you can't create new namespaces. I think all this errors was
fixed somewhere in 4.0.2, but I'm not sure where exactly.
Test which produces panic(or hang) basically starts 16 containers in
parallel, so it is 16 unshares+bindmount then unmount those
namespaces.
Also, here is info from one of my coworkers about deadlock:

mrjana [10:38 PM]
docker thread:

root@jenkins-prs-7:/proc/8895/task/8931# cat stack
[<ffffffff81466465>] copy_net_ns+0x75/0x150
[<ffffffff8108c3bd>] create_new_namespaces+0xfd/0x1a0
[<ffffffff8108c5ea>] unshare_nsproxy_namespaces+0x5a/0xc0
[<ffffffff8106d1c3>] SyS_unshare+0x183/0x330
[<ffffffff8156df4d>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

mrjana [10:38 PM]
This docker thread is waiting on net_mutex

mrjana [10:38 PM]
which is held by the kworker thread and is not returning:

mrjana [10:39 PM]
here’s the stack trace of kernel thread:

mrjana [10:39 PM]
root@jenkins-prs-7:/proc# cat /proc/6/stack
[<ffffffff810aec15>] mutex_optimistic_spin+0x185/0x1e0
[<ffffffff8147d5c5>] rtnl_lock+0x15/0x20
[<ffffffff8146c7a2>] default_device_exit_batch+0x72/0x160
[<ffffffff81465a83>] ops_exit_list.isra.1+0x53/0x60
[<ffffffff81466320>] cleanup_net+0x100/0x1d0
[<ffffffff81086064>] process_one_work+0x154/0x400
[<ffffffff81086a0b>] worker_thread+0x6b/0x490
[<ffffffff8108b8fb>] kthread+0xdb/0x100
[<ffffffff8156de98>] ret_from_fork+0x58/0x90
[<ffffffffffffffff>] 0xffffffffffffffff

mrjana [10:41 PM]
If you look at 3.18 code this thread acquires net_mutex at cleanup_net

mrjana [10:41 PM]
but this kworker thread has never released the net_mutex

mrjana [10:41 PM]
instead it is spinning on rtnl_lock

We tried on our CI versions 3.18, 3.19 and 4.0.1.

Feel free to ask if you need some additional info or machine where you
can reproduce easily.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html