On Mon 14-10-19 10:17:45, Tim Chen wrote: > We were running a database benchmark in mem cgroup and found that lruvec_lru_size > is taking up a huge chunk of CPU cycles (about 25% of our kernel time) on 5.3 kernel. > > The main issue is the loop in lruvec_page_state_local called by lruvec_lru_size > in the mem cgroup path: > > for_each_possible_cpu(cpu) > x += per_cpu(pn->lruvec_stat_local->count[idx], cpu); > > It is costly looping through all the cpus to get the lru vec size info. > And doing this on our workload with 96 cpu threads and 500 mem cgroups > makes things much worse. We might end up having 96 cpus * 500 cgroups * 2 (main) LRUs pagevecs, > which is a lot of data structures to be running through all the time. Why does the number of cgroup matter? > Hongwei's patch > (https://lore.kernel.org/linux-mm/991b4719-a2a0-9efe-de02-56a928752fe3@xxxxxxxxxx/) > restores the previous method for computing lru_size and is much more efficient in getting the lru_size. > We got a 20% throughput improvement in our database benchmark with Hongwei's patch, and > lruvec_lru_size's cpu overhead completely disappeared from the cpu profile. > > We'll like to see Hongwei's patch getting merged. The main problem with the patch was a lack of justification. If the performance approvement is this large (I am quite surprised TBH) then I would obviously not have any objections. Care to send a patch with the complete changelog? > The problem can also be reproduced by running simple multi-threaded pmbench benchmark > with a fast Optane SSD swap (see profile below). > > > 6.15% 3.08% pmbench [kernel.vmlinux] [k] lruvec_lru_size > | > |--3.07%--lruvec_lru_size > | | > | |--2.11%--cpumask_next > | | | > | | --1.66%--find_next_bit > | | > | --0.57%--call_function_interrupt > | | > | --0.55%--smp_call_function_interrupt > | > |--1.59%--0x441f0fc3d009 > | _ops_rdtsc_init_base_freq > | access_histogram > | page_fault > | __do_page_fault > | handle_mm_fault > | __handle_mm_fault > | | > | --1.54%--do_swap_page > | swapin_readahead > | swap_cluster_readahead > | | > | --1.53%--read_swap_cache_async > | __read_swap_cache_async > | alloc_pages_vma > | __alloc_pages_nodemask > | __alloc_pages_slowpath > | try_to_free_pages > | do_try_to_free_pages > | shrink_node > | shrink_node_memcg > | | > | |--0.77%--lruvec_lru_size > | | > | --0.76%--inactive_list_is_low > | | > | --0.76%--lruvec_lru_size > | > --1.50%--measure_read > page_fault > __do_page_fault > handle_mm_fault > __handle_mm_fault > do_swap_page > swapin_readahead > swap_cluster_readahead > | > --1.48%--read_swap_cache_async > __read_swap_cache_async > alloc_pages_vma > __alloc_pages_nodemask > __alloc_pages_slowpath > try_to_free_pages > do_try_to_free_pages > shrink_node > shrink_node_memcg > | > |--0.75%--inactive_list_is_low > | | > | --0.75%--lruvec_lru_size > | > --0.73%--lruvec_lru_size > > > Thanks. > > Tim -- Michal Hocko SUSE Labs