EPILOGUE-AS-PREAMBLE: I had already typed most of this when I thought to search the netfilter-devel archive. I found this, which sounds an awful lot like my issue: https://www.spinics.net/lists/netfilter-devel/msg56882.html However, the patch link in the first followup seems empty, so I can't verify that it's the same thing or that the proposed fix works for me. ---------------------------------------------------------------------- [1.] One line summary of the problem: 4.19.x kernels oops in nf_conncount_destroy. [2.] Full description of the problem/report: We have been running 4.18.x kernels, up through 4.18.20, in production for a small web/email hosting operation with no issues. Everything relevant here is 32-bit Linux on VMware ESXi. Upon the release of 4.18.20 and knowing that it was EOL, I stepped to then-current 4.19.4. One of our machines (a mail gateway) hung with an oops within a minute or two of boot. I rolled it back to deal with later. The next morning, another machine (coincidentally another mail gateway) crashed as well, and the tail end of the oops--that I could see on the 80x25 console--looked similar to what I remembered from the first. I rolled it back. If a third one happened, I was going to roll them all back. No other machines had issues. When 4.19.5 was released, I tried that, with the same effect, so I decided that since the fastest-crashing machine was, while production, not going to cause user-visible issues, I'd bisect to try to hunt down the cause. Every other machine, about 30 total, has been fine on 4.19.4 / 4.19.5. Bisecting led me to this. 5c789e131cbb997a528451564ea4613e812fc718 is the first bad commit commit 5c789e131cbb997a528451564ea4613e812fc718 Author: Yi-Hung Wei <yihung.wei@xxxxxxxxx> Date: Mon Jul 2 17:33:44 2018 -0700 netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search This patch is originally from Florian Westphal. This patch does the following 3 main tasks. 1) Add list lock to 'struct nf_conncount_list' so that we can alter the lists containing the individual connections without holding the main tree lock. It would be useful when we only need to add/remove to/from a list without allocate/remove a node in the tree. With this change, we update nft_connlimit accordingly since we longer need to maintain a list lock in nft_connlimit now. 2) Use RCU for the initial tree search to improve tree look up performance. 3) Add a garbage collection worker. This worker is schedule when there are excessive tree node that needed to be recycled. Moreover,the rbnode reclaim logic is moved from search tree to insert tree to avoid race condition. Signed-off-by: Yi-Hung Wei <yihung.wei@xxxxxxxxx> Signed-off-by: Florian Westphal <fw@xxxxxxxxx> Signed-off-by: Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> :040000 040000 3117a9e5f5d91c55bfcb495ed0cf20aac47beb4c eb16c3c84edfa70268c651490dd5031a6474ca2d M include :040000 040000 f69622ea9603500bc837f6348bc7ffe6e4edefda 8983dc24192abb1ae1925f023a495c39d171021c M net And it makes perfect sense: Our only two machines that use nf_connlimit in their firewall configs are those two mail gateways. I imagine that the speed at which they oops has to do with their specific connlimit settings and how quickly they accumulate enough traffic to hit one of them. Oops details are below. [3.] Keywords (i.e., modules, networking, kernel): netfilter, nf_conncount, nf_connlimit [4.] Kernel information [4.1.] Kernel version (from /proc/version): [4.2.] Kernel .config file: grep = .config, net-related stuff only: CONFIG_NET=y CONFIG_NET_INGRESS=y CONFIG_PACKET=y CONFIG_UNIX=y CONFIG_XFRM=y CONFIG_XFRM_ALGO=y CONFIG_XFRM_USER=y CONFIG_XFRM_SUB_POLICY=y CONFIG_XFRM_IPCOMP=m CONFIG_NET_KEY=m CONFIG_INET=y CONFIG_IP_MULTICAST=y CONFIG_IP_ADVANCED_ROUTER=y CONFIG_IP_MULTIPLE_TABLES=y CONFIG_INET_AH=m CONFIG_INET_ESP=m CONFIG_INET_IPCOMP=m CONFIG_INET_XFRM_TUNNEL=m CONFIG_INET_TUNNEL=m CONFIG_INET_XFRM_MODE_TRANSPORT=m CONFIG_INET_XFRM_MODE_TUNNEL=m CONFIG_INET_XFRM_MODE_BEET=m CONFIG_TCP_CONG_CUBIC=y CONFIG_DEFAULT_TCP_CONG="cubic" CONFIG_NET_PTP_CLASSIFY=y CONFIG_NETFILTER=y CONFIG_NETFILTER_ADVANCED=y CONFIG_NETFILTER_INGRESS=y CONFIG_NETFILTER_NETLINK=y CONFIG_NETFILTER_FAMILY_ARP=y CONFIG_NF_CONNTRACK=y CONFIG_NF_LOG_COMMON=y CONFIG_NETFILTER_CONNCOUNT=y CONFIG_NF_CONNTRACK_MARK=y CONFIG_NF_CONNTRACK_PROCFS=y CONFIG_NF_CONNTRACK_TIMEOUT=y CONFIG_NF_CONNTRACK_FTP=y CONFIG_NF_CT_NETLINK=y CONFIG_NF_CT_NETLINK_TIMEOUT=y CONFIG_NF_NAT=y CONFIG_NF_NAT_NEEDED=y CONFIG_NF_NAT_FTP=y CONFIG_NF_NAT_REDIRECT=y CONFIG_NF_TABLES=y CONFIG_NFT_CT=y CONFIG_NFT_CONNLIMIT=y CONFIG_NFT_LOG=y CONFIG_NFT_LIMIT=y CONFIG_NFT_MASQ=y CONFIG_NFT_NAT=y CONFIG_NFT_REJECT=y CONFIG_NF_FLOW_TABLE=m CONFIG_NETFILTER_XTABLES=y CONFIG_NETFILTER_XT_MARK=y CONFIG_NETFILTER_XT_CONNMARK=y CONFIG_NETFILTER_XT_TARGET_CONNMARK=y CONFIG_NETFILTER_XT_TARGET_LOG=y CONFIG_NETFILTER_XT_TARGET_MARK=y CONFIG_NETFILTER_XT_NAT=y CONFIG_NETFILTER_XT_TARGET_NETMAP=y CONFIG_NETFILTER_XT_TARGET_REDIRECT=y CONFIG_NETFILTER_XT_TARGET_TPROXY=m CONFIG_NETFILTER_XT_MATCH_COMMENT=y CONFIG_NETFILTER_XT_MATCH_CONNLIMIT=y CONFIG_NETFILTER_XT_MATCH_CONNMARK=y CONFIG_NETFILTER_XT_MATCH_CONNTRACK=y CONFIG_NETFILTER_XT_MATCH_ESP=m CONFIG_NETFILTER_XT_MATCH_HASHLIMIT=m CONFIG_NETFILTER_XT_MATCH_HELPER=y CONFIG_NETFILTER_XT_MATCH_IPRANGE=m CONFIG_NETFILTER_XT_MATCH_LENGTH=y CONFIG_NETFILTER_XT_MATCH_LIMIT=y CONFIG_NETFILTER_XT_MATCH_MARK=y CONFIG_NETFILTER_XT_MATCH_MULTIPORT=m CONFIG_NETFILTER_XT_MATCH_POLICY=y CONFIG_NETFILTER_XT_MATCH_STATE=y CONFIG_NETFILTER_XT_MATCH_STATISTIC=m CONFIG_NETFILTER_XT_MATCH_STRING=m CONFIG_NETFILTER_XT_MATCH_TCPMSS=m CONFIG_NF_DEFRAG_IPV4=y CONFIG_NF_CONNTRACK_IPV4=y CONFIG_NF_TPROXY_IPV4=m CONFIG_NF_TABLES_IPV4=y CONFIG_NFT_CHAIN_ROUTE_IPV4=y CONFIG_NFT_REJECT_IPV4=y CONFIG_NF_TABLES_ARP=y CONFIG_NF_FLOW_TABLE_IPV4=m CONFIG_NF_LOG_IPV4=y CONFIG_NF_REJECT_IPV4=y CONFIG_NF_NAT_IPV4=y CONFIG_NFT_CHAIN_NAT_IPV4=y CONFIG_NF_NAT_MASQUERADE_IPV4=y CONFIG_NFT_MASQ_IPV4=y CONFIG_IP_NF_IPTABLES=y CONFIG_IP_NF_FILTER=y CONFIG_IP_NF_TARGET_REJECT=y CONFIG_IP_NF_NAT=y CONFIG_IP_NF_TARGET_MASQUERADE=y CONFIG_IP_NF_TARGET_NETMAP=y CONFIG_IP_NF_TARGET_REDIRECT=y CONFIG_IP_NF_MANGLE=y [5.] Most recent kernel version which did not have the bug: 4.18.x is fine. 4.19+ all have it. [6.] Output of Oops.. message (if applicable) with symbolic information resolved (see Documentation/admin-guide/bug-hunting.rst) For most oopses, all I have is the tail 80x25 of the output since I can't scroll the console back. A lot of them had call traces that included bits like: EIP: native_safe_halt+0x5/0x7 [...] ? siphash_3u64+[...] default_idle+[...] arch_cpu_idle+[...] [...] as well as some IRQ stuff, which really made no sense to me. Then, one or two bisect steps from the end, I had one that didn't lock up the machine, so I could scroll back: BUG: unable to handle kernel NULL pointer dereference at 00000000 *pdpt = 00000000712fd001 *pde = 0000000000000000 Oops: 0000 [#1] SMP CPU: 1 PID 26422 Comm: iptables Not tainted 4.18.0-rc3-00851-ged07d9a021df #22 Hardware name: VMware, Inc. VMware Virtual Platform/400BX Desktop Reference Platform, BIOS 6.00 09/30/2014 EIP: nf_conncount_destroy+0x4d/0xa5 Code: ed 4c ff ff 89 f8 83 c7 04 05 04 04 00 00 89 45 ec 89 f8 [...] [...] Call Trace: connlimit_mt_destroy+0x14/0x16 cleanup_match+0x34/0x52 cleanup_entry+0x2e/0x8b do_ipt_set_ctl+0x412/0x48e ? do_ipt_get_ctl+0x39e/0x39e nf_setsockopt+0x37/0x57 ip_setsockopt+0x4b/0x5a [and so on back to entry_SYSENTER_32] Complete screen shots are available if they'll be of any use. [7.] A small shell script or example program which triggers the problem (if possible) When I saw "conncount", and knowing that it was our two mail gateways, my thoughts (above) jumped to our connlimit settings. HOWEVER. The oops says it was triggered by iptables. Both machines also use sshguard, which uses iptables to add DROP rules to a chain. (Nearly all our machines use sshguard, but the two mail gateways are the only two that give it more than occasional activity, and this one in particular gives it a decent workout.) For what it's worth, here is our connlimit setup anyway: ------------------------------------------------------------ The server has two rules that use connlimit: iptables -A <chain> -j REJECT -p tcp -s 0.0.0.0/0 -m connlimit --connlimit-above 3 --connlimit-mask 24 iptables -A <chain> -j REJECT -p tcp -s 0.0.0.0/0 -m connlimit --connlimit-above 2 --connlimit-mask 32 The other server has a few more such rules, but much less traffic that is likely to run afoul of them -- that would explain why it took much longer to crash. ------------------------------------------------------------ [8.] Environment [8.1.] Software (add the output of the ver_linux script here) GNU C 8.2.0 GNU Make 4.2.1 Binutils 2.31.1 Util-linux 2.31.1 Mount 2.31.1 Module-init-tools 25 Linux C Library 2.28 Dynamic linker (ldd) 2.28 Sh-utils 8.30 [8.2.] Processor information (from /proc/cpuinfo): 2-core VM, here's one: processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz stepping : 5 microcode : 0x19 cpu MHz : 2926.000 cache size : 8192 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fdiv_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss ht nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc cpuid aperfmperf pni ssse3 cx16 sse4_1 sse4_2 popcnt hypervisor lahf_lm dtherm ida bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass bogomips : 5852.00 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: [8.3.] Module information (from /proc/modules): [8.4.] Loaded driver and hardware information (/proc/ioports, /proc/iomem) [8.5.] PCI information ('lspci -vvv' as root) [8.6.] SCSI information (from /proc/scsi/scsi) [8.7.] Other information that might be relevant to the problem (please look in /proc and include all information that you think to be relevant): Since this machine will crash so reliably (usually within 2-5 minutes of boot on an affected kernel) and since it's not user-visible, I can test easily. Todd -- Todd Eigenschink Ferguson Advertising todd@xxxxxxxx http://www.fai2.com/ Non ex transverso sed deorsum 260-407-1584