This patchset change the conntrack locking and provides a huge performance improvements. This patchset is based upon Eric Dumazet's proposed patch: http://thread.gmane.org/gmane.linux.network/268758/focus=47306 I have in agreement with Eric Dumazet, taken over this patch (and turned it into a entire patchset). Primary focus is to remove the central spinlock nf_conntrack_lock. This requires several steps to be acheived. Patch01: Trivial cleanups Patch02: Moves the "special" dying/unconfirmed/template lists to use a per cpu spinlock. Patch03: Is preparing for patch04, as it address a race condition. Doing this a seperate patch for reviewers sake. Patch04: Seperates expect locking from nf_conntrack_lock. The expect list is small (default max 256), this it just get a single lock. Patch05: Finally can remove nf_conntrack_lock, and instead uses an array of hashed spinlocks to protect insertions/deletions of conntracks into the hash table. While still allowing dynamic resizing of the hash table. Testing ------- For expectations I've mostly tested the FTP nf_conntrack_ftp helper module, by commands: for x in `seq 1 300`; do \ echo $x; \ echo -e "USER anonymous\nPASS nothing\nPASV" | nc 192.168.42.129 21; \ done wget ftp://192.168.42.129/pub/delete.me.4k -O /dev/null For overload/DoS testing, I've primarily done, SYN-flood attack testing. Results on a 24-core E5-2695v2(ES) with 10Gbit/s ixgbe (with tool trafgen) Base kernel : New 810.405 conntrack/sec Fixed kernel: New 2.233.876 conntrack/sec Notice other floods attack (SYN+ACK or ACK) can easily be deflected using: # iptables -A INPUT -m state --state INVALID -j DROP # sysctl -w net/netfilter/nf_conntrack_tcp_loose=0 E.g. this machine can reflect 6.481.463 "invalid" conntrack/sec (from an ACK-flood). Perf data: ---------- The nf_conntrack_lock is suffers from huge contention on current generation servers (8 or more core/threads). Data from under SYN-flooding (without a listen socket) Perf locking congestion is very "visible" on a base kernel: - 72.56% ksoftirqd/6 [kernel.kallsyms] [k] _raw_spin_lock_bh - _raw_spin_lock_bh + 25.33% init_conntrack + 24.86% nf_ct_delete_from_lists + 24.62% __nf_conntrack_confirm + 24.38% destroy_conntrack + 0.70% tcp_packet + 2.21% ksoftirqd/6 [kernel.kallsyms] [k] fib_table_lookup + 1.15% ksoftirqd/6 [kernel.kallsyms] [k] __slab_free + 0.77% ksoftirqd/6 [kernel.kallsyms] [k] inet_getpeer + 0.70% ksoftirqd/6 [nf_conntrack] [k] nf_ct_delete + 0.55% ksoftirqd/6 [ip_tables] [k] ipt_do_table Perf after the patchset (SYN-flood attack): + 9.62% ksoftirqd/6 [kernel.kallsyms] [k] fib_table_lookup + 3.78% ksoftirqd/6 [kernel.kallsyms] [k] __slab_free + 2.71% ksoftirqd/6 [kernel.kallsyms] [k] inet_getpeer + 2.55% ksoftirqd/6 [kernel.kallsyms] [k] check_leaf + 2.38% ksoftirqd/6 [ip_tables] [k] ipt_do_table + 2.06% ksoftirqd/6 [kernel.kallsyms] [k] __slab_alloc + 1.94% ksoftirqd/6 [nf_conntrack] [k] __nf_conntrack_alloc - 1.94% ksoftirqd/6 [kernel.kallsyms] [k] _raw_spin_lock - _raw_spin_lock + 90.32% nf_conntrack_double_lock + 3.61% get_partial_node + 1.81% nf_ct_delete_from_lists + 1.68% __nf_conntrack_confirm + 1.03% sch_direct_xmit + 0.52% scheduler_tick + 1.86% ksoftirqd/6 [kernel.kallsyms] [k] nf_iterate + 1.80% ksoftirqd/6 [nf_conntrack] [k] init_conntrack + 1.77% ksoftirqd/6 [kernel.kallsyms] [k] __neigh_event_send - 1.70% ksoftirqd/6 [kernel.kallsyms] [k] _raw_spin_lock_bh - _raw_spin_lock_bh + 32.55% nf_ct_del_from_dying_or_unconfirmed_list + 25.33% init_conntrack + 19.88% tcp_packet + 17.97% nf_ct_delete_from_lists + 1.62% nf_conntrack_in + 1.33% ixgbe_poll + 0.74% destroy_conntrack + 1.64% ksoftirqd/6 [nf_conntrack] [k] hash_conntrack_raw + 1.58% ksoftirqd/6 [kernel.kallsyms] [k] __netif_receive_skb_core + 1.51% ksoftirqd/6 [nf_conntrack] [k] __nf_conntrack_find_get + 1.48% ksoftirqd/6 [kernel.kallsyms] [k] __cmpxchg_double_slab + 1.46% ksoftirqd/6 [nf_conntrack] [k] nf_conntrack_in + 1.45% ksoftirqd/6 [kernel.kallsyms] [k] __local_bh_enable_ip --- Jesper Dangaard Brouer (5): netfilter: conntrack: remove central spinlock nf_conntrack_lock netfilter: conntrack: seperate expect locking from nf_conntrack_lock netfilter: avoid race with exp->master ct netfilter: conntrack: spinlock per cpu to protect special lists. netfilter: trivial code cleanup and doc changes include/net/netfilter/nf_conntrack.h | 11 + include/net/netfilter/nf_conntrack_core.h | 9 + include/net/netns/conntrack.h | 13 + net/netfilter/nf_conntrack_core.c | 432 ++++++++++++++++++++--------- net/netfilter/nf_conntrack_expect.c | 36 ++ net/netfilter/nf_conntrack_h323_main.c | 4 net/netfilter/nf_conntrack_helper.c | 41 ++- net/netfilter/nf_conntrack_netlink.c | 128 +++++---- net/netfilter/nf_conntrack_sip.c | 8 - 9 files changed, 461 insertions(+), 221 deletions(-) -- -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html