On Wed, Feb 23, 2022 at 4:00 PM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote: > > Can you share a bit more detail on your hardware configuration (num of > > cpus) and if possible the flamegraph? > > We have a mix of 96 and 128 cpus. I'm not yet sure if it's possible to share the flamegraphs. We may have to come back to that later if necessary. > > Also if you can reproduce the issue, can you try the patch at > https://lore.kernel.org/all/20210929235936.2859271-1-shakeelb@xxxxxxxxxx/ > ? We can give it a try. I also wrote a bpftrace script to get the kernel stack when we encounter slow mem_cgroup_flush_stats ( with 10ms as threshold ) kprobe:mem_cgroup_flush_stats { @start[tid] = nsecs; @stack[tid] = kstack; } kretprobe:mem_cgroup_flush_stats /@start[tid]/ { $usecs = (nsecs - @start[tid]) / 1000; if ($usecs >= 10000) { printf("mem_cgroup_flush_stats: %d us\n", $usecs); printf("stack: %s\n", @stack[tid]); } delete(@start[tid]); delete(@stack[tid]); } END { clear(@start); clear(@stack); } Running it on a production node yields output like mem_cgroup_flush_stats: 10697 us stack: mem_cgroup_flush_stats+1 workingset_refault+296 add_to_page_cache_lru+159 page_cache_ra_unbounded+340 force_page_cache_ra+226 filemap_get_pages+233 filemap_read+164 xfs_file_buffered_read+152 xfs_file_read_iter+106 new_sync_read+277 vfs_read+242 __x64_sys_pread64+137 do_syscall_64+56 entry_SYSCALL_64_after_hwframe+68 I think the addition of many milliseconds on workingset_refault is too high.