nf_conntrack overflow crashes OSDs

kc@xxxxxxxxxx (Christian Kauhaus) · Fri, 08 Aug 2014 10:46:44 +0200

Hi,

today I'd like to share a severe problem we've found (and fixed) on our Ceph
cluster. We're running 48 OSDs (8 per host). While restarting all OSDs on a
host, the kernel's nf_conntrack table was overflown. This rendered all OSDs on
that machine unusable.

The symptoms were as follows. In the kernel log, we saw lines like:

| Aug  6 15:23:48 cartman06 kernel: [12713575.554784] nf_conntrack: table
full, dropping packet

This is effectively a DoS against the kernel's IP stack.

In the OSD log files, we saw repeated connection attempts like:

| 2014-08-06 15:22:35.348175 7f92f25a8700 10 -- 172.22.4.42:6802/9560 >>
172.22.4.51:0/2025662 pipe(0x7f9208035440 sd=382 :6802 s=2 pgs=26750 cs=1 l=1
c=0x7f92080021c0).fault on lossy channel, failing
| 2014-08-06 15:22:35.348287 7f8fd69e4700 10 -- 172.22.4.42:6802/9560 >>
172.22.4.39:0/3024957 pipe(0x7f9208007b30 sd=149 :6802 s=2 pgs=245725 cs=1 l=1
c=0x7f9208036630).fault on lossy channel, failing
| 2014-08-06 15:22:35.348293 7f8fe24e4700 20 -- 172.22.4.42:6802/9560 >>
172.22.4.38:0/1013265 pipe(0x7f92080476e0 sd=450 :6802 s=4 pgs=32439 cs=1 l=1
c=0x7f9208018e90).writer finishing
| 2014-08-06 15:22:35.348284 7f8fd4fca700  2 -- 172.22.4.42:6802/9560 >>
172.22.4.5:0/3032136 pipe(0x7f92080686b0 sd=305 :6802 s=2 pgs=306100 cs=1 l=1
c=0x7f920805f340).fault 0: Success
| 2014-08-06 15:22:35.348292 7f8fd108b700 20 -- 172.22.4.42:6802/9560 >>
172.22.4.4:0/1000901 pipe(0x7f920802e7d0 sd=401 :6802 s=4 pgs=73173 cs=1 l=1
c=0x7f920802eda0).writer finishing
| 2014-08-06 15:22:35.344719 7f8fd1d98700  2 -- 172.22.4.42:6802/9560 >>
172.22.4.49:0/3026524 pipe(0x7f9208033a80 sd=492 :6802 s=2 pgs=12845 cs=1 l=1
c=0x7f9208033ce0).reader couldn't read tag, Success

and so on, generating 1000s of log lines. The OSDs were spinning with 100%
CPU, trying to re-connect in rapid succession. The repeated connection
attempts stopped nf_conntrack from getting out of its overflown state.

Thus, we saw blocked requests for 15 minutes or so, until the MONs banned the
stuck OSDs from the cluster.

As a short term countermeasure, we stopped all OSDs on the affected hosts and
started them one by one, leaving enough time in between to allow the recovery
settle a bit (10 sec gap between OSDs was enough). During normal operation, we
see only 5000-6000 connections on a host.

As a permanent fix, we have doubled the size of the nf_conntrack table and
reduced some timeouts according to
<http://www.pc-freak.net/blog/resolving-nf_conntrack-table-full-dropping-packet-flood-message-in-dmesg-linux-kernel-log/>.
Now a restart of all 8 OSDs on a host works without problems.

Alternatively, we have considered removing nf_conntrack completely. This,
however, is not possible since we use host-based firewalling and nf_conntrack
is wired quite deeply into Linux' firewall code.

Just to share our experience in case someone experiences the same problem.

Regards

Christian

-- 
Dipl.-Inf. Christian Kauhaus <>< ? kc at gocept.com ? systems administration
gocept gmbh & co. kg ? Forsterstra?e 29 ? 06112 Halle (Saale) ? Germany
http://gocept.com ? tel +49 345 219401-11
Python, Pyramid, Plone, Zope ? consulting, development, hosting, operations