Re: Regression in workingset_refault latency on 5.15

Daniel Dao <dqminh@xxxxxxxxxxxxxx> · Thu, 24 Feb 2022 17:34:10 +0000

On Thu, Feb 24, 2022 at 4:58 PM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote:
>
> On Thu, Feb 24, 2022 at 02:46:27PM +0000, Daniel Dao wrote:
>
> [...]
>
>
> > 3) Summary of stack traces when mem_cgroup_flush_stats is over 5ms
>
>
> Can you please check if flush_memcg_stats_dwork() appears in any stack
> traces at all?

Here is the result of probes on flush_memcg_stats_dwork:

$ sudo /usr/share/bcc/tools/funccount -d 30 flush_memcg_stats_dwork
Tracing 1 functions for "b'flush_memcg_stats_dwork'"... Hit Ctrl-C to end.

FUNC                                    COUNT
b'flush_memcg_stats_dwork'                 14

 sudo /usr/share/bcc/tools/funclatency -d 30 flush_memcg_stats_dwork
Tracing 1 functions for "flush_memcg_stats_dwork"... Hit Ctrl-C to end.

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 8        |****************************************|
      8192 -> 16383      : 0        |                                        |
     16384 -> 32767      : 0        |                                        |
     32768 -> 65535      : 0        |                                        |
     65536 -> 131071     : 0        |                                        |
    131072 -> 262143     : 0        |                                        |
    262144 -> 524287     : 0        |                                        |
    524288 -> 1048575    : 0        |                                        |
   1048576 -> 2097151    : 1        |*****                                   |
   2097152 -> 4194303    : 4        |********************                    |
   4194304 -> 8388607    : 2        |**********                              |

avg = 1725693 nsecs, total: 25885397 nsecs, count: 15

So we triggered the async flush as expected, around every 2 seconds.
But they mostly
run faster than the inline call from workingset_refault. I think on busy servers
with varied workloads that touch swap/page_cache, it's very likely that most of
the cost is in inline mem_cgroup_flush_stats() of workingset_refault rather than
from async flush.

> Thanks for testing. At the moment I am suspecting the async worker is
> not getting the CPU. Can you share your CONFIG_HZ setting? Also can you
> try the following patch and see if that helps otherwise keep halving the
> delay (i.e. 2HZ -> HZ -> HZ/2 -> ...) and find at what value the issue
> you are seeing get resolved?

We have CONFIG_HZ=1000. We can try to increase the frequency of async flush, but
that seems like a not great bandaid. Is it possible to remove
mem_cgroup_flush_stats()
from workingset_refault, or at least scope it down to some targeted cgroup so
we don't need to flush from root with potentially large sets of
cgroups to walk ?