issue with conntrackd wrt handling dead connections

Michael Schnyder <mschnyder@xxxxxxxxxxxxxxxx> · Wed, 14 Aug 2019 14:38:49 +0000

Dear netfilter team,

A while ago we found the a correlation between a "locked" cpu and 
conntrack. We luckily received immediate support by you - awesome and 
many thanks! As we couldn't find a fix for the bug yet, allow me to 
summarize for any other person our findings in this mailing list:

What is it all about:
After a kernel update we could observe now and then some interrupts in 
the network communication and finally found a correlation between those 
hickups and conntrack.

Our setup:
We used to run 4.19.19 and updated the kernel to 4.19.52. The conntrack 
tools are of version 1.4.5.

How was the issue visible?
We are using different Intel NICs, one if it was driven by igb. That igb 
driver reported "Detected Tx Unit Hang" - exact at the time where 
conntrack/kernel was doing some work - and finally it ended in adapter 
resets:

------------------------------------- snip
[585429.004347] igb 0000:82:00.2: Detected Tx Unit Hang
                  Tx Queue             <0>
                  TDH                  <3b7>
                  TDT                  <3b7>
                  next_to_use          <3b7>
                  next_to_clean        <321>
                buffer_info[next_to_clean]
                  time_stamp           <1037cd094>
                  next_to_watch        <00000000708406c9>
                  jiffies              <1037cd1d4>
                  desc.status          <18c001>

...

kernel: WARNING: CPU: 17 PID: 99 at net/sched/sch_generic.c:461 
dev_watchdog+0x218/0x230
kernel: Modules linked in: (...)
kernel: CPU: 17 PID: 99 Comm: ksoftirqd/17 Kdump: loaded Tainted: G 
     O      4.19.52_1_osix_1 #1
kernel: Hardware name: Nexcom na7135, BIOS 5.6.5 08/22/2018
kernel: RIP: 0010:dev_watchdog+0x218/0x230
...
kernel: igb 0000:82:00.2 peth22: Reset adapter
------------------------------------- snip


We had to do some analysis:

Step 1:
We grepped the /proc/interrupts to find the cpu for peth22 queue0: And 
what we see? It's the same CPU -> CPU17

Step 2:
Watching with mpstat and recording with perf we finally found a clear 
candidate for the relation between our Tx Hangs/NIC resets and the CPU 
lock: conntrack

Step3:
Bind by binding conntrackd to a specific CPU we finally found the 
reaseon for: the entries of "active connection" in the conntrack db is 
on the limit (set in net.netfilter.nf_conntrack_max).

Here an example:
------------------------------------- snip
# igb on 0000:82:00.2 => Tx Queue 0 => CPU 17
# and conntrackd using CPU 17:
[host:~]# ps -o psr 14194
PSR
 17

We then "moved" to cpu 25:
[host:~]# taskset -cp 25 14194

What happend? The Tx HANG moved to a new queue:

igb 0000:82:00.1 -> Tx Queue 6 => CPU 25
# conntrackd was running there as configured above:
[host:~]# ps -o psr 28127
PSR
 25

Grepping again the interrupts showed, that the same CPU is used by 
peth21 Tx Queue 6, same messages in the logs (peth21 was on 0000:82:00.1).
------------------------------------- snip

Finally we could reproduce/verify the issue the following setup:


       .-~.
  .-~-(    )_
 /           ~ -.
|      net       \
 \              .'
   ~._____ . -~
       |
       |                       +-------------------+
       |  eth1 (10.0.0.123/24) | +-------------------+
       +---------------------+ | |  reproA:        | |
                               | |  HA setup with  | + --- mgmt-access 
(eth0)
                               | |  conntrack on.. | |
                               | |                 | |
                               +-------------------+ |
                                 +-------------------+
                                      192.168.4.x/24 (eth2-eth7)
                                      |
                                      |
                                      |
                                     .+~.
+------------+                  .+~+(    )_
|            |                 /           ~ +.
|  client    | -------------- +      net       \
|            | 192.168.4.2/24  \              .'
+------------+                  ~.______ . +~


------------------------------------- snip
# check the status on reproA
[reproA:~]# conntrackd.init status
...

# run on a client nmap:
root@client:~# nmap -sS 192.168.0.0/24


# recheck and observe the status on reproA:
[reproA:~]# conntrackd.init status

# compare "conntrackd active connections" vs "conntrack counter"
[reproA:~]# A=`conntrackd -i | wc -l`
[reproA:~]# B=`conntrack -C`

# my understanding: it should always be A =< B
[reproA:~]# [ $A -le $B ] && printf "All good" || printf "conntrack bug"
conntrack bug
------------------------------------- snip

We could observe that the "current active connections" was going higher 
and higher... It does decrease, but not the way as it should (See also 
the next snippet below).

To bring it to an end, let me add some additional information on 
conntrackd and kernel version:

With 4.19.19 we were running fine with conntrackd. The "current active 
connections" was always at around +/- at that value of what the flow 
entries which "conntrack -L" showed/listed. But since we used the 
4.19.52 we could observe issues with the internal cache, where the 
"current active connections" is at it's max of 524288 (nf_conntrack_max).

As an example for you, check the below conntrackd-output and you see 
this in the field of "failed" where the count was going high. This 
example was taken while "conntrack -L" reported about +/- 40000 entries. 
BUT you can see the value "current active connections" is at 524288. 
Which is _our_ value for net.netfilter.nf_conntrack_max:

------------------------------------- snip
[host:~]# conntrackd -s
cache internal:
current active connections:           524288
connections created:                43238863    failed:    182373143
connections updated:               206420332    failed:            0
connections destroyed:              42714575    failed:            0

cache external:
current active connections:             3757
connections created:                   50236    failed:            0
connections updated:                   11423    failed:            0
connections destroyed:                 46479    failed:            0

traffic processed:
        651600869905 Bytes                1814394269 Pckts

multicast traffic (active device=eth7):
         18458724608 Bytes sent             34282468 Bytes recv
           183464366 Pckts sent              1540600 Pckts recv
                   0 Error send                    0 Error recv

message tracking:
                   0 Malformed msgs                 1276 Lost msgs
------------------------------------- snip

Looking for any related commits we stumbled upon the following one:

Commit 7b115755fb9d3aff0ddcd18a5c4d83381362acce generates an 
almost-unique pseudo-id for a given conntrack.

At that point we asked for and luckily received support immediately. 
Florian Westphal provided a test-patch which we could test and which 
confirmed our findings:

diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -472,10 +472,9 @@ u32 nf_ct_get_id(const struct nf_conn *ct)
        net_get_random_once(&ct_id_seed, sizeof(ct_id_seed));

        a = (unsigned long)ct;
-       b = (unsigned long)ct->master ^ net_hash_mix(nf_ct_net(ct));
-       c = (unsigned long)ct->ext;
-       d = (unsigned long)siphash(&ct->tuplehash, sizeof(ct->tuplehash),
-                                  &ct_id_seed);
+       b = 0;
+       c = 0;
+       d = 0;

He also confirmed our findings: "something is wrong in conntrackd wrt. 
handling dead connections".

To make it short: We had to revert the problematic commit as long as we 
are with 4.19 LTS - on which we still are.

FYI: I tested the proposal of Dirk Morris (m=1565297875085610) which 
"changes ct id hash calculation to only use invariants": That one would 
help. Is there any chance of getting that into the upstream? Or what is 
the plan?

In case we can assist, let us know.

Regards,
  michi schnyder

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature