Re: [Lsf-pc] [LSF/MM TOPIC] Memory cgroups, whether you like it or not

Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx> · Wed, 4 Mar 2020 12:52:29 -0800

On 2/14/20 2:45 AM, Michal Hocko wrote:
> On Wed 05-02-20 10:34:57, Tim Chen wrote:
>> Topic: Memory cgroups, whether you like it or not
>>
>> 1. Memory cgroup counters scalability
>>
>> Recently, benchmark teams at Intel were running some bare-metal
>> benchmarks.  To our great surprise, we saw lots of memcg activity in
>> the profiles.  When we asked the benchmark team, they did not even
>> realize they were using memory cgroups.  They were fond of running all
>> their benchmarks in containers that just happened to use memory cgroups
>> by default.  What were previously problems only for memory cgroup users
>> are quickly becoming a problem for everyone.
>>
>> There are mem cgroup counters that are read in page management paths
>> which scale poorly when read.  These counters are per cpu based and
>> need to be summed over all CPUs to get the overall value for the mem
>> cgroup in lruvec_page_state_local function.  This led to scalability
>> problems on system with large numbers of CPUs. For example, we’ve seen 14+% kernel
>> CPU time consumed in snapshot_refaults().  We have also encountered a
>> similar issue recently when computing the lru_size[1].
>>
>> We'll like to do some brainstorming to see if there are ways to make
>> such accounting more scalable.  For example, not all usages
>> of such counters need precise counts, and some approximate counts that are
>> updated lazily can be used.
> 
> Please make sure to prepare numbers based on the current upstream kernel
> so that we have some grounds to base the discussion on. Ideally post
> them into the email.

Here's a profile on a 5.2 based kernel with some memory tiering modifications.
It shows that snapshot_refaults is consuming a big chunk of cpu cycles to gather the
refault stats stored in root memcg's lruvec's WORKINGSET_ACTIVATE field.

We have to read #memcg X #ncpu local counters 
to get the complete refault snapshot. So the computation scales poorly
with increasing number of memcg and cpus. 

We'll be recollecting some the data based on the 5.5 kernel. I will post
those when they become available. 

The cpu cycle percentage below shows the percentage of kernel cpu cycles taken.  And kernel
time consumed 31% of cpu cycles.  The MySQL workload ran on a 2 socket system with 24 cores
per socket.

    14.22%  mysqld           [kernel.kallsyms]         [k] snapshot_refaults
            |
            ---snapshot_refaults
               do_try_to_free_pages
               try_to_free_pages
               __alloc_pages_slowpath
               __alloc_pages_nodemask
               |
               |--14.07%--alloc_pages_vma
               |          |
               |           --14.06%--__handle_mm_fault
               |                     handle_mm_fault
               |                     |
               |                     |--12.57%--__get_user_pages
               |                     |          get_user_pages_unlocked
               |                     |          get_user_pages_fast
               |                     |          iov_iter_get_pages
               |                     |          do_blockdev_direct_IO
               |                     |          ext4_direct_IO
               |                     |          generic_file_read_iter
               |                     |          |
               |                     |          |--12.16%--new_sync_read
               |                     |          |          vfs_read
               |                     |          |          ksys_pread64
               |                     |          |          do_syscall_64
               |                     |          |          entry_SYSCALL_64_after_hwframe

Thanks.

Tim