issue with conntrackd wrt handling dead connections

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear netfilter team,

A while ago we found the a correlation between a "locked" cpu and conntrack. We luckily received immediate support by you - awesome and many thanks! As we couldn't find a fix for the bug yet, allow me to summarize for any other person our findings in this mailing list:

What is it all about:
After a kernel update we could observe now and then some interrupts in the network communication and finally found a correlation between those hickups and conntrack.

Our setup:
We used to run 4.19.19 and updated the kernel to 4.19.52. The conntrack tools are of version 1.4.5.

How was the issue visible?
We are using different Intel NICs, one if it was driven by igb. That igb driver reported "Detected Tx Unit Hang" - exact at the time where conntrack/kernel was doing some work - and finally it ended in adapter resets:

------------------------------------- snip
[585429.004347] igb 0000:82:00.2: Detected Tx Unit Hang
                  Tx Queue             <0>
                  TDH                  <3b7>
                  TDT                  <3b7>
                  next_to_use          <3b7>
                  next_to_clean        <321>
                buffer_info[next_to_clean]
                  time_stamp           <1037cd094>
                  next_to_watch        <00000000708406c9>
                  jiffies              <1037cd1d4>
                  desc.status          <18c001>

...

kernel: WARNING: CPU: 17 PID: 99 at net/sched/sch_generic.c:461 dev_watchdog+0x218/0x230
kernel: Modules linked in: (...)
kernel: CPU: 17 PID: 99 Comm: ksoftirqd/17 Kdump: loaded Tainted: G O 4.19.52_1_osix_1 #1
kernel: Hardware name: Nexcom na7135, BIOS 5.6.5 08/22/2018
kernel: RIP: 0010:dev_watchdog+0x218/0x230
...
kernel: igb 0000:82:00.2 peth22: Reset adapter
------------------------------------- snip


We had to do some analysis:

Step 1:
We grepped the /proc/interrupts to find the cpu for peth22 queue0: And what we see? It's the same CPU -> CPU17

Step 2:
Watching with mpstat and recording with perf we finally found a clear candidate for the relation between our Tx Hangs/NIC resets and the CPU lock: conntrack

Step3:
Bind by binding conntrackd to a specific CPU we finally found the reaseon for: the entries of "active connection" in the conntrack db is on the limit (set in net.netfilter.nf_conntrack_max).

Here an example:
------------------------------------- snip
# igb on 0000:82:00.2 => Tx Queue 0 => CPU 17
# and conntrackd using CPU 17:
[host:~]# ps -o psr 14194
PSR
 17

We then "moved" to cpu 25:
[host:~]# taskset -cp 25 14194

What happend? The Tx HANG moved to a new queue:

igb 0000:82:00.1 -> Tx Queue 6 => CPU 25
# conntrackd was running there as configured above:
[host:~]# ps -o psr 28127
PSR
 25

Grepping again the interrupts showed, that the same CPU is used by peth21 Tx Queue 6, same messages in the logs (peth21 was on 0000:82:00.1).
------------------------------------- snip

Finally we could reproduce/verify the issue the following setup:


       .-~.
  .-~-(    )_
 /           ~ -.
|      net       \
 \              .'
   ~._____ . -~
       |
       |                       +-------------------+
       |  eth1 (10.0.0.123/24) | +-------------------+
       +---------------------+ | |  reproA:        | |
| | HA setup with | + --- mgmt-access (eth0)
                               | |  conntrack on.. | |
                               | |                 | |
                               +-------------------+ |
                                 +-------------------+
                                      192.168.4.x/24 (eth2-eth7)
                                      |
                                      |
                                      |
                                     .+~.
+------------+                  .+~+(    )_
|            |                 /           ~ +.
|  client    | -------------- +      net       \
|            | 192.168.4.2/24  \              .'
+------------+                  ~.______ . +~


------------------------------------- snip
# check the status on reproA
[reproA:~]# conntrackd.init status
...

# run on a client nmap:
root@client:~# nmap -sS 192.168.0.0/24


# recheck and observe the status on reproA:
[reproA:~]# conntrackd.init status

# compare "conntrackd active connections" vs "conntrack counter"
[reproA:~]# A=`conntrackd -i | wc -l`
[reproA:~]# B=`conntrack -C`

# my understanding: it should always be A =< B
[reproA:~]# [ $A -le $B ] && printf "All good" || printf "conntrack bug"
conntrack bug
------------------------------------- snip

We could observe that the "current active connections" was going higher and higher... It does decrease, but not the way as it should (See also the next snippet below).

To bring it to an end, let me add some additional information on conntrackd and kernel version:

With 4.19.19 we were running fine with conntrackd. The "current active connections" was always at around +/- at that value of what the flow entries which "conntrack -L" showed/listed. But since we used the 4.19.52 we could observe issues with the internal cache, where the "current active connections" is at it's max of 524288 (nf_conntrack_max).

As an example for you, check the below conntrackd-output and you see this in the field of "failed" where the count was going high. This example was taken while "conntrack -L" reported about +/- 40000 entries. BUT you can see the value "current active connections" is at 524288. Which is _our_ value for net.netfilter.nf_conntrack_max:

------------------------------------- snip
[host:~]# conntrackd -s
cache internal:
current active connections:           524288
connections created:                43238863    failed:    182373143
connections updated:               206420332    failed:            0
connections destroyed:              42714575    failed:            0

cache external:
current active connections:             3757
connections created:                   50236    failed:            0
connections updated:                   11423    failed:            0
connections destroyed:                 46479    failed:            0

traffic processed:
        651600869905 Bytes                1814394269 Pckts

multicast traffic (active device=eth7):
         18458724608 Bytes sent             34282468 Bytes recv
           183464366 Pckts sent              1540600 Pckts recv
                   0 Error send                    0 Error recv

message tracking:
                   0 Malformed msgs                 1276 Lost msgs
------------------------------------- snip

Looking for any related commits we stumbled upon the following one:

Commit 7b115755fb9d3aff0ddcd18a5c4d83381362acce generates an almost-unique pseudo-id for a given conntrack.

At that point we asked for and luckily received support immediately. Florian Westphal provided a test-patch which we could test and which confirmed our findings:

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -472,10 +472,9 @@ u32 nf_ct_get_id(const struct nf_conn *ct)
        net_get_random_once(&ct_id_seed, sizeof(ct_id_seed));

        a = (unsigned long)ct;
-       b = (unsigned long)ct->master ^ net_hash_mix(nf_ct_net(ct));
-       c = (unsigned long)ct->ext;
-       d = (unsigned long)siphash(&ct->tuplehash, sizeof(ct->tuplehash),
-                                  &ct_id_seed);
+       b = 0;
+       c = 0;
+       d = 0;

He also confirmed our findings: "something is wrong in conntrackd wrt. handling dead connections".

To make it short: We had to revert the problematic commit as long as we are with 4.19 LTS - on which we still are.

FYI: I tested the proposal of Dirk Morris (m=1565297875085610) which "changes ct id hash calculation to only use invariants": That one would help. Is there any chance of getting that into the upstream? Or what is the plan?

In case we can assist, let us know.

Regards,
  michi schnyder

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature


[Index of Archives]     [Linux Netfilter Development]     [Linux Kernel Networking Development]     [Netem]     [Berkeley Packet Filter]     [Linux Kernel Development]     [Advanced Routing & Traffice Control]     [Bugtraq]

  Powered by Linux