On 2/14/20 2:45 AM, Michal Hocko wrote: > On Wed 05-02-20 10:34:57, Tim Chen wrote: >> Topic: Memory cgroups, whether you like it or not >> >> 1. Memory cgroup counters scalability >> >> Recently, benchmark teams at Intel were running some bare-metal >> benchmarks. To our great surprise, we saw lots of memcg activity in >> the profiles. When we asked the benchmark team, they did not even >> realize they were using memory cgroups. They were fond of running all >> their benchmarks in containers that just happened to use memory cgroups >> by default. What were previously problems only for memory cgroup users >> are quickly becoming a problem for everyone. >> >> There are mem cgroup counters that are read in page management paths >> which scale poorly when read. These counters are per cpu based and >> need to be summed over all CPUs to get the overall value for the mem >> cgroup in lruvec_page_state_local function. This led to scalability >> problems on system with large numbers of CPUs. For example, we’ve seen 14+% kernel >> CPU time consumed in snapshot_refaults(). We have also encountered a >> similar issue recently when computing the lru_size[1]. >> >> We'll like to do some brainstorming to see if there are ways to make >> such accounting more scalable. For example, not all usages >> of such counters need precise counts, and some approximate counts that are >> updated lazily can be used. > > Please make sure to prepare numbers based on the current upstream kernel > so that we have some grounds to base the discussion on. Ideally post > them into the email. Here's a profile on a 5.2 based kernel with some memory tiering modifications. It shows that snapshot_refaults is consuming a big chunk of cpu cycles to gather the refault stats stored in root memcg's lruvec's WORKINGSET_ACTIVATE field. We have to read #memcg X #ncpu local counters to get the complete refault snapshot. So the computation scales poorly with increasing number of memcg and cpus. We'll be recollecting some the data based on the 5.5 kernel. I will post those when they become available. The cpu cycle percentage below shows the percentage of kernel cpu cycles taken. And kernel time consumed 31% of cpu cycles. The MySQL workload ran on a 2 socket system with 24 cores per socket. 14.22% mysqld [kernel.kallsyms] [k] snapshot_refaults | ---snapshot_refaults do_try_to_free_pages try_to_free_pages __alloc_pages_slowpath __alloc_pages_nodemask | |--14.07%--alloc_pages_vma | | | --14.06%--__handle_mm_fault | handle_mm_fault | | | |--12.57%--__get_user_pages | | get_user_pages_unlocked | | get_user_pages_fast | | iov_iter_get_pages | | do_blockdev_direct_IO | | ext4_direct_IO | | generic_file_read_iter | | | | | |--12.16%--new_sync_read | | | vfs_read | | | ksys_pread64 | | | do_syscall_64 | | | entry_SYSCALL_64_after_hwframe Thanks. Tim