Hi, On Fri, Sep 7, 2018 at 12:21 AM, Eric Dumazet <edumazet@xxxxxxxxxx> wrote: > On Fri, Sep 7, 2018 at 12:03 AM Eric Dumazet <edumazet@xxxxxxxxxx> wrote: > >> Problem is : we have platforms with more than 100 cpus, and >> sk_memory_allocated() cost will be too expensive, >> especially if the host is under memory pressure, since all cpus will >> touch their private counter. >> >> per cpu variables do not really scale, they were ok 10 years ago when >> no more than 16 cpus were the norm. >> >> I would prefer change TCP to not aggressively call >> __sk_mem_reduce_allocated() from tcp_write_timer() >> >> Ideally only tcp_retransmit_timer() should attempt to reduce forward >> allocations, after recurring timeout. >> >> Note that after 20c64d5cd5a2bdcdc8982a06cb05e5e1bd851a3d ("net: avoid >> sk_forward_alloc overflows") >> we have better control over sockets having huge forward allocations. >> >> Something like : > > Or something less risky : I gave both of these patches a run, and neither do as well on the system that has slower atomics. :( The percpu version: 8.05% workload [kernel.vmlinux] [k] __do_softirq 7.04% swapper [kernel.vmlinux] [k] cpuidle_enter_state 5.54% workload [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore 1.66% swapper [kernel.vmlinux] [k] __do_softirq 1.55% workload [kernel.vmlinux] [k] finish_task_switch 1.24% swapper [kernel.vmlinux] [k] finish_task_switch 1.07% workload [kernel.vmlinux] [k] net_rx_action The first patch from you still has significant amount of time spent in the atomics paths (non-inlined versions used): 7.87% workload [kernel.vmlinux] [k] __ll_sc_atomic64_sub 7.48% workload [kernel.vmlinux] [k] __do_softirq 5.05% workload [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore 2.42% workload [kernel.vmlinux] [k] __ll_sc_atomic64_add_return 1.49% swapper [kernel.vmlinux] [k] cpuidle_enter_state 1.31% workload [kernel.vmlinux] [k] finish_task_switch 1.09% workload [kernel.vmlinux] [k] tcp_sendmsg_locked 1.08% workload [kernel.vmlinux] [k] __arch_copy_from_user 1.02% workload [kernel.vmlinux] [k] net_rx_action I think a lot of the overhead from percpu approach can be alleviated if we can use percpu_counter_read() instead of _sum() (i.e. no need to iterate through the local per-cpu recent delta). I don't know the TCP stack well enough to tell where it's OK to use a bit of slack in the numbers though -- by default count will at most be off by 32*online cpus. Might not be a significant number in reality. -Olof