On Thu, Feb 24, 2022 at 4:58 PM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote: > > On Thu, Feb 24, 2022 at 02:46:27PM +0000, Daniel Dao wrote: > > [...] > > > > 3) Summary of stack traces when mem_cgroup_flush_stats is over 5ms > > > Can you please check if flush_memcg_stats_dwork() appears in any stack > traces at all? Here is the result of probes on flush_memcg_stats_dwork: $ sudo /usr/share/bcc/tools/funccount -d 30 flush_memcg_stats_dwork Tracing 1 functions for "b'flush_memcg_stats_dwork'"... Hit Ctrl-C to end. FUNC COUNT b'flush_memcg_stats_dwork' 14 sudo /usr/share/bcc/tools/funclatency -d 30 flush_memcg_stats_dwork Tracing 1 functions for "flush_memcg_stats_dwork"... Hit Ctrl-C to end. nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 0 | | 2048 -> 4095 : 0 | | 4096 -> 8191 : 8 |****************************************| 8192 -> 16383 : 0 | | 16384 -> 32767 : 0 | | 32768 -> 65535 : 0 | | 65536 -> 131071 : 0 | | 131072 -> 262143 : 0 | | 262144 -> 524287 : 0 | | 524288 -> 1048575 : 0 | | 1048576 -> 2097151 : 1 |***** | 2097152 -> 4194303 : 4 |******************** | 4194304 -> 8388607 : 2 |********** | avg = 1725693 nsecs, total: 25885397 nsecs, count: 15 So we triggered the async flush as expected, around every 2 seconds. But they mostly run faster than the inline call from workingset_refault. I think on busy servers with varied workloads that touch swap/page_cache, it's very likely that most of the cost is in inline mem_cgroup_flush_stats() of workingset_refault rather than from async flush. > Thanks for testing. At the moment I am suspecting the async worker is > not getting the CPU. Can you share your CONFIG_HZ setting? Also can you > try the following patch and see if that helps otherwise keep halving the > delay (i.e. 2HZ -> HZ -> HZ/2 -> ...) and find at what value the issue > you are seeing get resolved? We have CONFIG_HZ=1000. We can try to increase the frequency of async flush, but that seems like a not great bandaid. Is it possible to remove mem_cgroup_flush_stats() from workingset_refault, or at least scope it down to some targeted cgroup so we don't need to flush from root with potentially large sets of cgroups to walk ?