memcgroup lruvec_lru_size scaling issue

Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx> · Mon, 14 Oct 2019 10:17:45 -0700

We were running a database benchmark in mem cgroup and found that lruvec_lru_size
is taking up a huge chunk of CPU cycles (about 25% of our kernel time) on 5.3 kernel.

The main issue is the loop in lruvec_page_state_local called by lruvec_lru_size
in the mem cgroup path:

for_each_possible_cpu(cpu)
	x += per_cpu(pn->lruvec_stat_local->count[idx], cpu);

It is costly looping through all the cpus to get the lru vec size info.
And doing this on our workload with 96 cpu threads and 500 mem cgroups
makes things much worse.  We might end up having 96 cpus * 500 cgroups * 2 (main) LRUs pagevecs,
which is a lot of data structures to be running through all the time.

Hongwei's patch
(https://lore.kernel.org/linux-mm/991b4719-a2a0-9efe-de02-56a928752fe3@xxxxxxxxxx/)
restores the previous method for computing lru_size and is much more efficient in getting the lru_size.
We got a 20% throughput improvement in our database benchmark with Hongwei's patch, and
lruvec_lru_size's cpu overhead completely disappeared from the cpu profile.

We'll like to see Hongwei's patch getting merged.

The problem can also be reproduced by running simple multi-threaded pmbench benchmark
with a fast Optane SSD swap (see profile below).

6.15%     3.08%  pmbench          [kernel.vmlinux]            [k] lruvec_lru_size
            |
            |--3.07%--lruvec_lru_size
            |          |
            |          |--2.11%--cpumask_next
            |          |          |
            |          |           --1.66%--find_next_bit
            |          |
            |           --0.57%--call_function_interrupt
            |                     |
            |                      --0.55%--smp_call_function_interrupt
            |
            |--1.59%--0x441f0fc3d009
            |          _ops_rdtsc_init_base_freq
            |          access_histogram
            |          page_fault
            |          __do_page_fault
            |          handle_mm_fault
            |          __handle_mm_fault
            |          |
            |           --1.54%--do_swap_page
            |                     swapin_readahead
            |                     swap_cluster_readahead
            |                     |
            |                      --1.53%--read_swap_cache_async
            |                                __read_swap_cache_async
            |                                alloc_pages_vma
            |                                __alloc_pages_nodemask
            |                                __alloc_pages_slowpath
            |                                try_to_free_pages
            |                                do_try_to_free_pages
            |                                shrink_node
            |                                shrink_node_memcg
            |                                |
            |                                |--0.77%--lruvec_lru_size
            |                                |
            |                                 --0.76%--inactive_list_is_low
            |                                           |
            |                                            --0.76%--lruvec_lru_size
            |
             --1.50%--measure_read
                       page_fault
                       __do_page_fault
                       handle_mm_fault
                       __handle_mm_fault
                       do_swap_page
                       swapin_readahead
                       swap_cluster_readahead
                       |
                        --1.48%--read_swap_cache_async
                                  __read_swap_cache_async
                                  alloc_pages_vma
                                  __alloc_pages_nodemask
                                  __alloc_pages_slowpath
                                  try_to_free_pages
                                  do_try_to_free_pages
                                  shrink_node
                                  shrink_node_memcg
                                  |
                                  |--0.75%--inactive_list_is_low
                                  |          |
                                  |           --0.75%--lruvec_lru_size
                                  |
                                   --0.73%--lruvec_lru_size

Thanks.

Tim