Cc: netfilter-devel@xxxxxxxxxxxxxxx, I'm afraid I'll need some help for this case. On Sat, Dec 17, 2016 at 09:48:13PM +0200, Denys Fedoryshchenko wrote: > Hi, > > I posted recently several netfilter related crashes, didn't got any answers, > one of them started to happen quite often on loaded NAT (17Gbps), > so after trying endless ways to make it stable, i found out that in > backtrace i can often see timers, and this bug probably appearing on older > releases, > i've seen such backtrace with timer fired for conntrack on them. > I disabled Intel turbo for cpus on this loaded NAT, and voila, panic > disappeared for 2nd day! > * by wrmsr -a 0x1a0 0x4000850089 > I am not sure timers is the reason, but probably turbo creating some > condition for bug. > Re-formatting the stack-trace for easier reference: [28904.162607] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [28904.163210] IP: [<ffffffffa00ab07d>] nf_ct_add_to_dying_list+0x55/0x61 [nf_conntrack] [28904.163745] PGD 0 [28904.164058] Oops: 0002 [#1] SMP [28904.164323] Modules linked in: nf_nat_pptp nf_nat_proto_gre xt_TCPMSS xt_connmark ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_nat xt_rateest xt_RATEEST nf_conntrack_pptp nf_conntrack_proto_gre xt_CT xt_set xt_hl xt_tcpudp ip_set_hash_net ip_set nfnetlink iptable_raw iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_filter ip_tables x_tables netconsole configfs 8021q garp mrp stp llc bonding ixgbe dca [28904.168132] CPU: 27 PID: 0 Comm: swapper/27 Not tainted 4.8.14-build-0124 #2 [28904.168398] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.1008.031920151331 03/19/2015 [28904.168853] task: ffff885fa42e8c40 task.stack: ffff885fa42f0000 [28904.169114] RIP: 0010:[<ffffffffa00ab07d>] [<ffffffffa00ab07d>] nf_ct_add_to_dying_list+0x55/0x61 [nf_conntrack] [28904.169643] RSP: 0018:ffff885fbccc3dd8 EFLAGS: 00010246 [28904.169901] RAX: 0000000000000000 RBX: ffff885fbccc0000 RCX: ffff885fbccc0010 [28904.170169] RDX: ffff885f87a1c150 RSI: 0000000000000142 RDI: ffff885fbccc0000 [28904.170437] RBP: ffff885fbccc3de8 R08: 00000000cbdee177 R09: 0000000000000100 [28904.170704] R10: ffff885fbccc3dd0 R11: ffffffff820050c0 R12: ffff885f87a1c140 [28904.170971] R13: 000000000005d948 R14: 00000000000ea942 R15: ffff885f87a1c160 [28904.171237] FS: 0000000000000000(0000) GS:ffff885fbccc0000(0000) knlGS:0000000000000000 [28904.171688] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [28904.171964] CR2: 0000000000000008 CR3: 000000607f006000 CR4: 00000000001406e0 [28904.172231] Stack: [28904.172482] ffff885f87a1c140 ffffffff820a1405 ffff885fbccc3e28 ffffffffa00abb30 [28904.173182] 00000002820a1405 ffff885f87a1c140 ffff885f99a28201 0000000000000000 [28904.173884] 0000000000000000 ffffffff820050c8 ffff885fbccc3e58 ffffffffa00abc62 [28904.174585] Call Trace: [28904.174835] <IRQ> [28904.174912] [<ffffffffa00abb30>] nf_ct_delete_from_lists+0xc9/0xf2 [nf_conntrack] [28904.175613] [<ffffffffa00abc62>] nf_ct_delete+0x109/0x12c [nf_conntrack] [28904.175894] [<ffffffffa00abc85>] ? nf_ct_delete+0x12c/0x12c [nf_conntrack] [28904.176169] [<ffffffffa00abc92>] death_by_timeout+0xd/0xf [nf_conntrack] [28904.176443] [<ffffffff81109922>] call_timer_fn.isra.5+0x17/0x6b [28904.176714] [<ffffffff811099e5>] expire_timers+0x6f/0x7e [28904.176975] [<ffffffff81109add>] run_timer_softirq+0x69/0x8b [28904.177238] [<ffffffff811141bb>] ? clockevents_program_event+0xd0/0xe8 [28904.177504] [<ffffffff810d000c>] __do_softirq+0xbd/0x1aa [28904.177765] [<ffffffff810d0240>] irq_exit+0x37/0x7c [28904.178026] [<ffffffff8102c519>] smp_trace_apic_timer_interrupt+0x7b/0x88 [28904.178300] [<ffffffff8102c52f>] smp_apic_timer_interrupt+0x9/0xb [28904.178565] [<ffffffff818f4f1c>] apic_timer_interrupt+0x7c/0x90 [28904.178835] <EOI> [28904.178907] [<ffffffff8101b929>] ? mwait_idle+0x64/0x7a [28904.179436] [<ffffffff810e3d8f>] ? atomic_notifier_call_chain+0x13/0x15 [28904.179712] [<ffffffff8101bd04>] arch_cpu_idle+0xa/0xc [28904.179976] [<ffffffff810f7c3d>] default_idle_call+0x27/0x29 [28904.180244] [<ffffffff810f7d5c>] cpu_startup_entry+0x11d/0x1c7 [28904.180508] [<ffffffff8102af13>] start_secondary+0xe8/0xeb [28904.180767] Code: 80 2f 0b 82 48 89 df e8 da 90 84 e1 48 8b 43 10 49 8d 54 24 10 48 8d 4b 10 49 89 4c 24 18 a8 01 49 89 44 24 10 48 89 53 10 75 04 <89> 50 08 c6 03 00 5b 41 5c 5d c3 48 8b 05 10 be 00 00 89 f6 [28904.185546] RIP [<ffffffffa00ab07d>] nf_ct_add_to_dying_list+0x55/0x61 [nf_conntrack] [28904.186065] RSP <ffff885fbccc3dd8> [28904.186319] CR2: 0000000000000008 [28904.186593] ---[ end trace 35cbc6c885a5c2d8 ]--- [28904.186860] Kernel panic - not syncing: Fatal exception in interrupt [28904.187155] Kernel Offset: disabled [28904.187419] Rebooting in 5 seconds.. [28909.193662] ACPI MEMORY or I/O RESET_REG. And here's decodecode's output: All code ======== 0: 80 2f 0b subb $0xb,(%rdi) 3: 82 (bad) 4: 48 89 df mov %rbx,%rdi 7: e8 da 90 84 e1 callq 0xffffffffe18490e6 c: 48 8b 43 10 mov 0x10(%rbx),%rax 10: 49 8d 54 24 10 lea 0x10(%r12),%rdx 15: 48 8d 4b 10 lea 0x10(%rbx),%rcx 19: 49 89 4c 24 18 mov %rcx,0x18(%r12) 1e: a8 01 test $0x1,%al 20: 49 89 44 24 10 mov %rax,0x10(%r12) 25: 48 89 53 10 mov %rdx,0x10(%rbx) 29: 75 04 jne 0x2f 2b:* 89 50 08 mov %edx,0x8(%rax) <-- trapping instruction 2e: c6 03 00 movb $0x0,(%rbx) 31: 5b pop %rbx 32: 41 5c pop %r12 34: 5d pop %rbp 35: c3 retq 36: 48 8b 05 10 be 00 00 mov 0xbe10(%rip),%rax # 0xbe4d 3d: 89 f6 mov %esi,%esi Fortunately, nf_ct_add_to_dying_list() is small: static void nf_ct_add_to_dying_list(struct nf_conn *ct) { struct ct_pcpu *pcpu; /* add this conntrack to the (per cpu) dying list */ ct->cpu = smp_processor_id(); pcpu = per_cpu_ptr(nf_ct_net(ct)->ct.pcpu_lists, ct->cpu); spin_lock(&pcpu->lock); hlist_nulls_add_head(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode, &pcpu->dying); spin_unlock(&pcpu->lock); } So "callq 0xffffffffe18490e6" probably refers to the "spin_lock(&pcpu->lock)" call, which means RBX == pcpu. Considering the offsets used, we certainly have R12 == ct. >From there, we can interpret the assembly code as follow: Original register values: rbx: pcpu 0x10(%rbx): &pcpu->dying.first r12: ct 0x10(%r12): &ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode.next 0x18(%r12): &ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode.pprev Code: mov %rbx,%rdi ; RBX == pcpu callq 0xffffffffe18490e6 ; spin_lock(&pcpu->lock) mov 0x10(%rbx),%rax ; rax = pcpu->dying.first lea 0x10(%r12),%rdx ; rdx = &ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode.next lea 0x10(%rbx),%rcx ; rcx = &pcpu->dying.first mov %rcx,0x18(%r12) ; ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode.pprev = &pcpu->dying.first test $0x1,%al ; test: is_a_nulls(pcpu->dying.first) mov %rax,0x10(%r12) ; &ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode.next = pcpu->dying.first mov %rdx,0x10(%rbx) ; pcpu->dying.first = &ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode jne 0x2f ; jump if is_a_nulls(pcpu->dying.first) mov %edx,0x8(%rax) <-- trapping instruction, RAX == 0 So RAX, which has been set to pcpu->dying.first, is NULL. The pcpu->dying list is thus corrupted. I'm not familiar with nf_conntrack_core.c and I still can't figure out how we could end up with pcpu->dying.first being a NULL pointer. Probably there's a race somewhere and activating turbo mode makes it more likely to happen. But still, I'd expect a locking or RCU issue to give us garbage here, not a plain NULL pointer. Does anyone have a clue? -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html