Re: 4.9 conntrack performance issues

Denys Fedoryshchenko <nuclearcat@xxxxxxxxxxxxxx> · Sun, 15 Jan 2017 02:18:45 +0200

On 2017-01-15 01:53, Florian Westphal wrote:
Denys Fedoryshchenko <nuclearcat@xxxxxxxxxxxxxx> wrote:

[ CC Nicolas since he also played with gc heuristics in the past ]

Sorry if i added someone wrongly to CC, please let me know, if i 
should
remove.
I just run successfully 4.9 on my nat several days ago, and seems 
panic
issue disappeared. But i started to face another issue, it seems 
garbage
collector is hogging one of CPU's.

It was handling load very well at 4.8 and below, it might be still 
fine, but
i suspect queues that belong to hogged cpu might experience issues.

The worker doesn't grab locks for long and calls scheduler for every
bucket to give a chance for other threads to run.

It also doesn't block softinterrupts.

Is there anything can be done to improve cpu load distribution or 
reduce
single core load?

No, I am afraid we don't export any of the heuristics as tuneables so
far.

You could try changing defaults in net/netfilter/nf_conntrack_core.c:

#define GC_MAX_BUCKETS_DIV      64u
/* upper bound of scan intervals */
#define GC_INTERVAL_MAX         (2 * HZ)
/* maximum conntracks to evict per gc run */
#define GC_MAX_EVICTS           256u

(the first two result in ~2 minute worst case timeout detection
 on a fully idle system).

For instance you could use

GC_MAX_BUCKETS_DIV -> 128
GC_INTERVAL_MAX    -> 30 * HZ

(This means that it takes one hour for a dead connection to be picked
 up on an idle system, but thats only relevant in case you use
 conntrack events to log when connection went down and need more 
precise
 accounting).
Not a big deal in my case.

I suspect you might also have to change

1011         } else if (expired_count) {
1012                 gc_work->next_gc_run /= 2U;
1013                 next_run = msecs_to_jiffies(1);
1014         } else {

line 2013 to
	next_run = msecs_to_jiffies(HZ / 2);

or something like this to not have frequent rescans.
OK

The gc is also done from the packet path (i.e. accounted
towards (k)softirq).

How many total connections is the machine handling on average?
And how many new/delete events happen per second?
1-2 million connections, at current moment 988k
I dont know if it is correct method to measure events rate:

NAT ~ # timeout -t 5 conntrack -E -e NEW | wc -l
conntrack v1.4.2 (conntrack-tools): 40027 flow events have been shown.
40027
NAT ~ # timeout -t 5 conntrack -E -e DESTROY | wc -l
conntrack v1.4.2 (conntrack-tools): 40951 flow events have been shown.
40951

It is not peak time, so values can be 2-3 higher at peak time, but even 
right now, it is hogging one core, leaving only 20% idle left,
while others are 80-83% idle.

    88.98%     0.00%  kworker/24:1  [kernel.kallsyms]       [k]
process_one_work
            |
            ---process_one_work
               |
               |--54.65%--gc_worker
               |          |
               |           --3.58%--nf_ct_gc_expired
               |                     |
               |                     |--1.90%--nf_ct_delete

I'd be interested to see how often that shows up on other cores
(from packet path).
Other CPU's totally different:
This is top entry
    99.60%     0.00%  swapper  [kernel.kallsyms]    [k] start_secondary
            |
            ---start_secondary
               |
                --99.42%--cpu_startup_entry
                          |
                           --98.04%--default_idle_call
                                     arch_cpu_idle
                                     |

|--48.58%--call_function_single_interrupt
                                     |          |
                                     |           
--46.36%--smp_call_function_single_interrupt
                                     |                     
smp_trace_call_function_single_interrupt
                                     |                     |
                                     |                     
|--44.18%--irq_exit
                                     |                     |          |
                                     |                     |          
|--43.37%--__do_softirq
                                     |                     |          |  
        |
                                     |                     |          |  
         --43.18%--net_rx_action
                                     |                     |          |  
                   |
                                     |                     |          |  
                   |--36.02%--process_backlog
                                     |                     |          |  
                   |          |
                                     |                     |          |  
                   |           --35.64%--__netif_receive_skb

gc_worker didnt appeared on other core at all.
Or i am checking something wrong?

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html