Hi, On Thu, Sep 6, 2018 at 8:32 PM, Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx> wrote: > On Thu, Sep 06, 2018 at 12:33:58PM -0700, Eric Dumazet wrote: >> On Thu, Sep 6, 2018 at 12:21 PM Olof Johansson <olof@xxxxxxxxx> wrote: >> > >> > Today these are all global shared variables per protocol, and in >> > particular tcp_memory_allocated can get hot on a system with >> > large number of CPUs and a substantial number of connections. >> > >> > Moving it over to a per-cpu variable makes it significantly cheaper, >> > and the added overhead when summing up the percpu copies is still smaller >> > than the cost of having a hot cacheline bouncing around. >> >> I am curious. We never noticed contention on this variable, at least for TCP. > > Yes these variables are heavily amortised so I'm surprised that > they would cause much contention. > >> Please share some numbers with us. > > Indeed. Certainly, just had to collect them again. This is on a dual xeon box, with ~150-200k TCP connections. I see about .7% CPU spent in __sk_mem_{reduce,raise}_allocated in the inlined atomic ops, most of those in reduce. Call path for reduce is practically all from tcp_write_timer on softirq: __sk_mem_reduce_allocated tcp_write_timer call_timer_fn run_timer_softirq __do_softirq irq_exit smp_apic_timer_interrupt apic_timer_interrupt cpuidle_enter_state With this patch, I see about .18+.11+.07 = .36% in percpu-related functions called from the same __sk_mem functions. Now, that's a halving of cycles samples on that specific setup. The real difference though, is on another platform where atomics are more expensive. There, this makes a significant difference. Unfortunately, I can't share specifics but I think this change stands on its own on the dual xeon setup as well, maybe with slightly less strong wording on just how hot the variable/line happens to be. -Olof