v5: - Apply the following changes to patch 3 1) Make cgroup_name() write directly into kbuf without using an intermediate buffer. 2) Change the terminology from "offline memcg" to "dying memcg" to align better with similar terms used elsewhere in the kernel. v4: - Take rcu_read_lock() when memcg is being accessed as suggested by Michal. - Make print_page_owner_memcg() return the new offset into the buffer and put CONFIG_MEMCG block inside as suggested by Mike. - Directly use TASK_COMM_LEN as length of name buffer as suggested by Roman. v3: - Add unlikely() to patch 1 and clarify that -1 will not be returned. - Use a helper function to print out memcg information in patch 3. - Add a new patch 4 to store task command name in page_owner structure. While debugging the constant increase in percpu memory consumption on a system that spawned large number of containers, it was found that a lot of dying mem_cgroup structures remained in place without being freed. Further investigation indicated that those mem_cgroup structures were pinned by some pages. In order to find out what those pages are, the existing page_owner debugging tool is extended to show memory cgroup information and whether those memcgs are dying or not. With the enhanced page_owner tool, the following is a typical page that pinned the mem_cgroup structure in my test case: Page allocated via order 0, mask 0x1100cca(GFP_HIGHUSER_MOVABLE), pid 70984 (podman), ts 5421278969115 ns, free_ts 5420935666638 ns PFN 3205061 type Movable Block 6259 type Movable Flags 0x17ffffc00c001c(uptodate|dirty|lru|reclaim|swapbacked|node=0|zone=2|lastcpupid=0x1fffff) prep_new_page+0x8e/0xb0 get_page_from_freelist+0xc4d/0xe50 __alloc_pages+0x172/0x320 alloc_pages_vma+0x84/0x230 shmem_alloc_page+0x3f/0x90 shmem_alloc_and_acct_page+0x76/0x1c0 shmem_getpage_gfp+0x48d/0x890 shmem_write_begin+0x36/0xc0 generic_perform_write+0xed/0x1d0 __generic_file_write_iter+0xdc/0x1b0 generic_file_write_iter+0x5d/0xb0 new_sync_write+0x11f/0x1b0 vfs_write+0x1ba/0x2a0 ksys_write+0x59/0xd0 do_syscall_64+0x37/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae Charged to dying memcg libpod-conmon-fbc62060b5377479a7371cc16c5c596002945f2aa00d3d6d73a0cd0d148b6637.scope So the page was not freed because it was part of a shmem segment. That is useful information that can help users to diagnose similar problems. With cgroup v1, /proc/cgroups can be read to find out the total number of memory cgroups (online + dying). With cgroup v2, the cgroup.stat of the root cgroup can be read to find the number of dying cgroups (most likely pinned by dying memcgs). The page_owner feature is not supposed to be enabled for production system due to its memory overhead. However, if it is suspected that dying memcgs are increasing over time, a test environment with page_owner enabled can then be set up with appropriate workload for further analysis on what may be causing the increasing number of dying memcgs. Waiman Long (4): lib/vsprintf: Avoid redundant work with 0 size mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check mm/page_owner: Print memcg information mm/page_owner: Record task command name lib/vsprintf.c | 8 +++--- mm/page_owner.c | 72 ++++++++++++++++++++++++++++++++++++++----------- 2 files changed, 62 insertions(+), 18 deletions(-) -- 2.27.0