> Can you please first confirm if you are running 5.15.19 kernel? If > that is the case then it means the machine is seeing (nr_cpus * 32) > stat updates. Yes, we're on 5.15.19. > Couple of questions: > 1) Did you see any call stack other than xfs read where expensive > mem_cgroup_flush_stats() is showing up? Yes: mem_cgroup_flush_stats: 7717 us stack: mem_cgroup_flush_stats+1 workingset_refault+296 __read_swap_cache_async+512 swapin_readahead+817 do_swap_page+561 __handle_mm_fault+2318 handle_mm_fault+191 do_user_addr_fault+428 exc_page_fault+103 asm_exc_page_fault+30 > 2) Can you please use the similar bpf+kprobe tracing for the > memcg_rstat_updated() (or __mod_memcg_lruvec_state()) to find the > source of frequent stat updates. "memcg_rstat_updated" is "static inline". With the following: bpftrace -e 'kprobe:__mod_memcg_lruvec_state { @stacks[kstack(10)]++ }' The top stacks are: @stacks[ __mod_memcg_lruvec_state+1 __mod_lruvec_page_state+96 clear_page_dirty_for_io+292 write_cache_pages+323 iomap_writepages+28 xfs_vm_writepages+113 do_writepages+195 __writeback_single_inode+55 writeback_sb_inodes+493 __writeback_inodes_wb+76 ]: 36429 @stacks[ __mod_memcg_lruvec_state+1 __mod_lruvec_page_state+96 __test_set_page_writeback+421 iomap_do_writepage+1241 write_cache_pages+393 iomap_writepages+28 xfs_vm_writepages+113 do_writepages+195 __writeback_single_inode+55 writeback_sb_inodes+493 ]: 36431 @stacks[ __mod_memcg_lruvec_state+1 mod_objcg_state+421 memcg_slab_post_alloc_hook+345 __kmalloc_node_track_caller+386 __alloc_skb+139 alloc_skb_with_frags+75 sock_alloc_send_pskb+495 unix_stream_sendmsg+599 sock_sendmsg+94 sock_write_iter+151 ]: 39459 @stacks[ __mod_memcg_lruvec_state+1 release_pages+760 tlb_finish_mmu+89 zap_page_range+265 do_madvise.part.0+1844 __x64_sys_madvise+69 do_syscall_64+59 entry_SYSCALL_64_after_hwframe+68 ]: 54648 @stacks[ __mod_memcg_lruvec_state+1 __mod_lruvec_page_state+96 page_remove_rmap+85 unmap_page_range+1413 zap_page_range+224 do_madvise.part.0+1844 __x64_sys_madvise+69 do_syscall_64+59 entry_SYSCALL_64_after_hwframe+68 ]: 54719 @stacks[ __mod_memcg_lruvec_state+1 __pagevec_lru_add+256 lru_cache_add+82 __handle_mm_fault+4686 handle_mm_fault+191 do_user_addr_fault+428 exc_page_fault+103 asm_exc_page_fault+30 ]: 98103 @stacks[ __mod_memcg_lruvec_state+1 __mod_lruvec_page_state+96 page_add_new_anon_rmap+78 __handle_mm_fault+4675 handle_mm_fault+191 do_user_addr_fault+428 exc_page_fault+103 asm_exc_page_fault+30 ]: 105071 > 3) I am still pondering why disabling swap resolves the issue for you. > Is that only for a workload different from xfs read? My understanding is that any block IO (including swap) triggers new memcg accounting code. In our process we don't have any other IO than swap, so disabling swap removes the major (if not only) vector of triggering this issue. > 4) Can you please also trace cgroup_rstat_flush_locked()? Stacks are: bpftrace -e 'kprobe:cgroup_rstat_flush_locked { @stacks[kstack(10)]++ }' Attaching 1 probe... ^C @stacks[ cgroup_rstat_flush_locked+1 cgroup_rstat_flush_irqsafe+36 __mem_cgroup_flush_stats+51 shrink_node+163 balance_pgdat+777 kswapd+508 kthread+295 ret_from_fork+34 ]: 1 @stacks[ cgroup_rstat_flush_locked+1 cgroup_rstat_flush_irqsafe+36 __mem_cgroup_flush_stats+51 workingset_refault+296 __read_swap_cache_async+512 swapin_readahead+817 do_swap_page+561 __handle_mm_fault+2318 handle_mm_fault+191 do_user_addr_fault+428 ]: 5 @stacks[ cgroup_rstat_flush_locked+1 cgroup_rstat_flush_irqsafe+36 __mem_cgroup_flush_stats+51 workingset_refault+296 add_to_page_cache_lru+159 page_cache_ra_unbounded+340 filemap_get_pages+569 filemap_read+164 xfs_file_buffered_read+152 xfs_file_read_iter+106 ]: 18 @stacks[ cgroup_rstat_flush_locked+1 cgroup_rstat_flush_irqsafe+36 __mem_cgroup_flush_stats+51 workingset_refault+296 add_to_page_cache_lru+159 page_cache_ra_unbounded+340 filemap_get_pages+233 filemap_read+164 xfs_file_buffered_read+152 xfs_file_read_iter+106 ]: 31 @stacks[ cgroup_rstat_flush_locked+1 cgroup_rstat_flush_irqsafe+36 __mem_cgroup_flush_stats+51 workingset_refault+296 add_to_page_cache_lru+159 page_cache_ra_unbounded+340 force_page_cache_ra+226 filemap_get_pages+233 filemap_read+164 xfs_file_buffered_read+152 ]: 223 Let me know if you only need the slow ones. I should mention that there are really two issues: 1. Expensive workingset_refault, which shows up on flamegraphs. We see it for our rocksdb based database, which persists data on xfs (local nvme). 2. Expensive workingset_refault that causes latency hiccups, but doesn't show up on flamegraphs. We see it in our nginx based proxy with swap enabled (either zram or regular file on xfs on local nvme). We solved the latter by disabling swap. I think the proper solution would be for workingset_refault to be fast enough to be invisible, in line with what was happening on Linux 5.10.