We were running a database benchmark in mem cgroup and found that lruvec_lru_size is taking up a huge chunk of CPU cycles (about 25% of our kernel time) on 5.3 kernel. The main issue is the loop in lruvec_page_state_local called by lruvec_lru_size in the mem cgroup path: for_each_possible_cpu(cpu) x += per_cpu(pn->lruvec_stat_local->count[idx], cpu); It is costly looping through all the cpus to get the lru vec size info. And doing this on our workload with 96 cpu threads and 500 mem cgroups makes things much worse. We might end up having 96 cpus * 500 cgroups * 2 (main) LRUs pagevecs, which is a lot of data structures to be running through all the time. Hongwei's patch (https://lore.kernel.org/linux-mm/991b4719-a2a0-9efe-de02-56a928752fe3@xxxxxxxxxx/) restores the previous method for computing lru_size and is much more efficient in getting the lru_size. We got a 20% throughput improvement in our database benchmark with Hongwei's patch, and lruvec_lru_size's cpu overhead completely disappeared from the cpu profile. We'll like to see Hongwei's patch getting merged. The problem can also be reproduced by running simple multi-threaded pmbench benchmark with a fast Optane SSD swap (see profile below). 6.15% 3.08% pmbench [kernel.vmlinux] [k] lruvec_lru_size | |--3.07%--lruvec_lru_size | | | |--2.11%--cpumask_next | | | | | --1.66%--find_next_bit | | | --0.57%--call_function_interrupt | | | --0.55%--smp_call_function_interrupt | |--1.59%--0x441f0fc3d009 | _ops_rdtsc_init_base_freq | access_histogram | page_fault | __do_page_fault | handle_mm_fault | __handle_mm_fault | | | --1.54%--do_swap_page | swapin_readahead | swap_cluster_readahead | | | --1.53%--read_swap_cache_async | __read_swap_cache_async | alloc_pages_vma | __alloc_pages_nodemask | __alloc_pages_slowpath | try_to_free_pages | do_try_to_free_pages | shrink_node | shrink_node_memcg | | | |--0.77%--lruvec_lru_size | | | --0.76%--inactive_list_is_low | | | --0.76%--lruvec_lru_size | --1.50%--measure_read page_fault __do_page_fault handle_mm_fault __handle_mm_fault do_swap_page swapin_readahead swap_cluster_readahead | --1.48%--read_swap_cache_async __read_swap_cache_async alloc_pages_vma __alloc_pages_nodemask __alloc_pages_slowpath try_to_free_pages do_try_to_free_pages shrink_node shrink_node_memcg | |--0.75%--inactive_list_is_low | | | --0.75%--lruvec_lru_size | --0.73%--lruvec_lru_size Thanks. Tim