4.19.x kernels oops in nf_conncount_destroy

"Todd Eigenschink" <todd@xxxxxxxx> · Wed, 28 Nov 2018 01:08:12 -0500

EPILOGUE-AS-PREAMBLE:

I had already typed most of this when I thought to search the
netfilter-devel archive. I found this, which sounds an awful lot like
my issue:

https://www.spinics.net/lists/netfilter-devel/msg56882.html

However, the patch link in the first followup seems empty, so I can't
verify that it's the same thing or that the proposed fix works for me.

----------------------------------------------------------------------

[1.] One line summary of the problem:

4.19.x kernels oops in nf_conncount_destroy.

[2.] Full description of the problem/report:

We have been running 4.18.x kernels, up through 4.18.20, in production
for a small web/email hosting operation with no issues. Everything
relevant here is 32-bit Linux on VMware ESXi. Upon the release of
4.18.20 and knowing that it was EOL, I stepped to then-current 4.19.4.

One of our machines (a mail gateway) hung with an oops within a minute
or two of boot. I rolled it back to deal with later.

The next morning, another machine (coincidentally another mail
gateway) crashed as well, and the tail end of the oops--that I could
see on the 80x25 console--looked similar to what I remembered from the
first. I rolled it back. If a third one happened, I was going to roll
them all back. No other machines had issues.

When 4.19.5 was released, I tried that, with the same effect, so I
decided that since the fastest-crashing machine was, while production,
not going to cause user-visible issues, I'd bisect to try to hunt down
the cause. Every other machine, about 30 total, has been fine on
4.19.4 / 4.19.5.

Bisecting led me to this. 

5c789e131cbb997a528451564ea4613e812fc718 is the first bad commit
commit 5c789e131cbb997a528451564ea4613e812fc718
Author: Yi-Hung Wei <yihung.wei@xxxxxxxxx>
Date:   Mon Jul 2 17:33:44 2018 -0700

    netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search

    This patch is originally from Florian Westphal.

    This patch does the following 3 main tasks.

    1) Add list lock to 'struct nf_conncount_list' so that we can
    alter the lists containing the individual connections without holding the
    main tree lock.  It would be useful when we only need to add/remove to/from
    a list without allocate/remove a node in the tree.  With this change, we
    update nft_connlimit accordingly since we longer need to maintain
    a list lock in nft_connlimit now.

    2) Use RCU for the initial tree search to improve tree look up performance.

    3) Add a garbage collection worker. This worker is schedule when there
    are excessive tree node that needed to be recycled.

    Moreover,the rbnode reclaim logic is moved from search tree to insert tree
    to avoid race condition.

    Signed-off-by: Yi-Hung Wei <yihung.wei@xxxxxxxxx>
    Signed-off-by: Florian Westphal <fw@xxxxxxxxx>
    Signed-off-by: Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx>

:040000 040000 3117a9e5f5d91c55bfcb495ed0cf20aac47beb4c eb16c3c84edfa70268c651490dd5031a6474ca2d M	include
:040000 040000 f69622ea9603500bc837f6348bc7ffe6e4edefda 8983dc24192abb1ae1925f023a495c39d171021c M	net

And it makes perfect sense: Our only two machines that use
nf_connlimit in their firewall configs are those two mail gateways. I
imagine that the speed at which they oops has to do with their
specific connlimit settings and how quickly they accumulate enough
traffic to hit one of them.

Oops details are below.

[3.] Keywords (i.e., modules, networking, kernel):

netfilter, nf_conncount, nf_connlimit

[4.] Kernel information

[4.1.] Kernel version (from /proc/version):

[4.2.] Kernel .config file:

grep = .config, net-related stuff only:

CONFIG_NET=y
CONFIG_NET_INGRESS=y
CONFIG_PACKET=y
CONFIG_UNIX=y
CONFIG_XFRM=y
CONFIG_XFRM_ALGO=y
CONFIG_XFRM_USER=y
CONFIG_XFRM_SUB_POLICY=y
CONFIG_XFRM_IPCOMP=m
CONFIG_NET_KEY=m
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_INET_AH=m
CONFIG_INET_ESP=m
CONFIG_INET_IPCOMP=m
CONFIG_INET_XFRM_TUNNEL=m
CONFIG_INET_TUNNEL=m
CONFIG_INET_XFRM_MODE_TRANSPORT=m
CONFIG_INET_XFRM_MODE_TUNNEL=m
CONFIG_INET_XFRM_MODE_BEET=m
CONFIG_TCP_CONG_CUBIC=y
CONFIG_DEFAULT_TCP_CONG="cubic"
CONFIG_NET_PTP_CLASSIFY=y
CONFIG_NETFILTER=y
CONFIG_NETFILTER_ADVANCED=y
CONFIG_NETFILTER_INGRESS=y
CONFIG_NETFILTER_NETLINK=y
CONFIG_NETFILTER_FAMILY_ARP=y
CONFIG_NF_CONNTRACK=y
CONFIG_NF_LOG_COMMON=y
CONFIG_NETFILTER_CONNCOUNT=y
CONFIG_NF_CONNTRACK_MARK=y
CONFIG_NF_CONNTRACK_PROCFS=y
CONFIG_NF_CONNTRACK_TIMEOUT=y
CONFIG_NF_CONNTRACK_FTP=y
CONFIG_NF_CT_NETLINK=y
CONFIG_NF_CT_NETLINK_TIMEOUT=y
CONFIG_NF_NAT=y
CONFIG_NF_NAT_NEEDED=y
CONFIG_NF_NAT_FTP=y
CONFIG_NF_NAT_REDIRECT=y
CONFIG_NF_TABLES=y
CONFIG_NFT_CT=y
CONFIG_NFT_CONNLIMIT=y
CONFIG_NFT_LOG=y
CONFIG_NFT_LIMIT=y
CONFIG_NFT_MASQ=y
CONFIG_NFT_NAT=y
CONFIG_NFT_REJECT=y
CONFIG_NF_FLOW_TABLE=m
CONFIG_NETFILTER_XTABLES=y
CONFIG_NETFILTER_XT_MARK=y
CONFIG_NETFILTER_XT_CONNMARK=y
CONFIG_NETFILTER_XT_TARGET_CONNMARK=y
CONFIG_NETFILTER_XT_TARGET_LOG=y
CONFIG_NETFILTER_XT_TARGET_MARK=y
CONFIG_NETFILTER_XT_NAT=y
CONFIG_NETFILTER_XT_TARGET_NETMAP=y
CONFIG_NETFILTER_XT_TARGET_REDIRECT=y
CONFIG_NETFILTER_XT_TARGET_TPROXY=m
CONFIG_NETFILTER_XT_MATCH_COMMENT=y
CONFIG_NETFILTER_XT_MATCH_CONNLIMIT=y
CONFIG_NETFILTER_XT_MATCH_CONNMARK=y
CONFIG_NETFILTER_XT_MATCH_CONNTRACK=y
CONFIG_NETFILTER_XT_MATCH_ESP=m
CONFIG_NETFILTER_XT_MATCH_HASHLIMIT=m
CONFIG_NETFILTER_XT_MATCH_HELPER=y
CONFIG_NETFILTER_XT_MATCH_IPRANGE=m
CONFIG_NETFILTER_XT_MATCH_LENGTH=y
CONFIG_NETFILTER_XT_MATCH_LIMIT=y
CONFIG_NETFILTER_XT_MATCH_MARK=y
CONFIG_NETFILTER_XT_MATCH_MULTIPORT=m
CONFIG_NETFILTER_XT_MATCH_POLICY=y
CONFIG_NETFILTER_XT_MATCH_STATE=y
CONFIG_NETFILTER_XT_MATCH_STATISTIC=m
CONFIG_NETFILTER_XT_MATCH_STRING=m
CONFIG_NETFILTER_XT_MATCH_TCPMSS=m
CONFIG_NF_DEFRAG_IPV4=y
CONFIG_NF_CONNTRACK_IPV4=y
CONFIG_NF_TPROXY_IPV4=m
CONFIG_NF_TABLES_IPV4=y
CONFIG_NFT_CHAIN_ROUTE_IPV4=y
CONFIG_NFT_REJECT_IPV4=y
CONFIG_NF_TABLES_ARP=y
CONFIG_NF_FLOW_TABLE_IPV4=m
CONFIG_NF_LOG_IPV4=y
CONFIG_NF_REJECT_IPV4=y
CONFIG_NF_NAT_IPV4=y
CONFIG_NFT_CHAIN_NAT_IPV4=y
CONFIG_NF_NAT_MASQUERADE_IPV4=y
CONFIG_NFT_MASQ_IPV4=y
CONFIG_IP_NF_IPTABLES=y
CONFIG_IP_NF_FILTER=y
CONFIG_IP_NF_TARGET_REJECT=y
CONFIG_IP_NF_NAT=y
CONFIG_IP_NF_TARGET_MASQUERADE=y
CONFIG_IP_NF_TARGET_NETMAP=y
CONFIG_IP_NF_TARGET_REDIRECT=y
CONFIG_IP_NF_MANGLE=y

[5.] Most recent kernel version which did not have the bug:

4.18.x is fine. 4.19+ all have it.

[6.] Output of Oops.. message (if applicable) with symbolic information
     resolved (see Documentation/admin-guide/bug-hunting.rst)

For most oopses, all I have is the tail 80x25 of the output since I
can't scroll the console back. A lot of them had call traces that
included bits like:

EIP: native_safe_halt+0x5/0x7
[...]
 ? siphash_3u64+[...]
 default_idle+[...]
 arch_cpu_idle+[...]
 [...]

as well as some IRQ stuff, which really made no sense to me. Then, one
or two bisect steps from the end, I had one that didn't lock up the
machine, so I could scroll back:

BUG: unable to handle kernel NULL pointer dereference at 00000000
*pdpt = 00000000712fd001 *pde = 0000000000000000
Oops: 0000 [#1] SMP
CPU: 1 PID 26422 Comm: iptables Not tainted 4.18.0-rc3-00851-ged07d9a021df #22
Hardware name: VMware, Inc. VMware Virtual Platform/400BX Desktop Reference Platform, BIOS 6.00 09/30/2014
EIP: nf_conncount_destroy+0x4d/0xa5
Code: ed 4c ff ff 89 f8 83 c7 04 05 04 04 00 00 89 45 ec 89 f8 [...]
[...]
Call Trace:
 connlimit_mt_destroy+0x14/0x16
 cleanup_match+0x34/0x52
 cleanup_entry+0x2e/0x8b
 do_ipt_set_ctl+0x412/0x48e
 ? do_ipt_get_ctl+0x39e/0x39e
 nf_setsockopt+0x37/0x57
 ip_setsockopt+0x4b/0x5a
 [and so on back to entry_SYSENTER_32]

Complete screen shots are available if they'll be of any use.

[7.] A small shell script or example program which triggers the
     problem (if possible)

When I saw "conncount", and knowing that it was our two mail gateways,
my thoughts (above) jumped to our connlimit settings.

HOWEVER. The oops says it was triggered by iptables. Both machines
also use sshguard, which uses iptables to add DROP rules to a chain.
(Nearly all our machines use sshguard, but the two mail gateways are
the only two that give it more than occasional activity, and this one
in particular gives it a decent workout.)

For what it's worth, here is our connlimit setup anyway:

------------------------------------------------------------
The server has two rules that use connlimit:

iptables -A <chain> -j REJECT -p tcp -s 0.0.0.0/0 -m connlimit --connlimit-above 3 --connlimit-mask 24
iptables -A <chain> -j REJECT -p tcp -s 0.0.0.0/0 -m connlimit --connlimit-above 2 --connlimit-mask 32

The other server has a few more such rules, but much less traffic that
is likely to run afoul of them -- that would explain why it took much
longer to crash.
------------------------------------------------------------

[8.] Environment

[8.1.] Software (add the output of the ver_linux script here)

GNU C               	8.2.0
GNU Make            	4.2.1
Binutils            	2.31.1
Util-linux          	2.31.1
Mount               	2.31.1
Module-init-tools   	25
Linux C Library     	2.28
Dynamic linker (ldd)	2.28
Sh-utils            	8.30

[8.2.] Processor information (from /proc/cpuinfo):

2-core VM, here's one:

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Xeon(R) CPU           X5570  @ 2.93GHz
stepping	: 5
microcode	: 0x19
cpu MHz		: 2926.000
cache size	: 8192 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 2
apicid		: 0
initial apicid	: 0
fdiv_bug	: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss ht nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc cpuid aperfmperf pni ssse3 cx16 sse4_1 sse4_2 popcnt hypervisor lahf_lm dtherm ida
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass
bogomips	: 5852.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

[8.3.] Module information (from /proc/modules):

[8.4.] Loaded driver and hardware information (/proc/ioports, /proc/iomem)

[8.5.] PCI information ('lspci -vvv' as root)

[8.6.] SCSI information (from /proc/scsi/scsi)

[8.7.] Other information that might be relevant to the problem
       (please look in /proc and include all information that you
       think to be relevant):

Since this machine will crash so reliably (usually within 2-5 minutes
of boot on an affected kernel) and since it's not user-visible, I can
test easily.

Todd
-- 
Todd Eigenschink                Ferguson Advertising
todd@xxxxxxxx                   http://www.fai2.com/
Non ex transverso sed deorsum   260-407-1584