Re: Regression in workingset_refault latency on 5.15

Shakeel Butt <shakeelb@xxxxxxxxxx> · Wed, 23 Feb 2022 07:57:35 -0800

+Ivan Babrou

Hi Daniel,

On Wed, Feb 23, 2022 at 5:51 AM Daniel Dao <dqminh@xxxxxxxxxxxxxx> wrote:
>
> Hi all,
>
> We are observing some regressions in workingset_refault on our newly upgraded
> 5.15.19 nodes with zram as swap. This manifests in several ways:
>
> 1) Regression of workingset_refault duration observed in flamegraph
>
> We regularly collect flamegraphs for running services on the node. Since upgrade
> to 5.15.19, we see that workingset_refault occupied a more significant part of
> the service flamegraph (13%) with the following call trace
>
>   workingset_refault+0x128
>   add_to_page_cache_lru+0x9f
>   page_cache_ra_unbounded+0x154
>   force_page_cache_ra+0xe2
>   filemap_get_pages+0xe9
>   filemap_read+0xa4
>   xfs_file_buffered_read+0x98
>   xfs_file_read_iter+0x6a
>   new_sync_read+0x118
>   vfs_read+0xf2
>   __x64_sys_pread64+0x89
>   do_syscall_64+0x3b
>   entry_SYSCALL_64_after_hwframe+0x44
>
> 2) Regression of userspace performance sensitive code
>
> We have some performance sensentive code running in userspace that have their
> runtime measured by CLOCK_THREAD_CPUTIME_ID. They look roughly as:
>
>   now = clock_gettime(CLOCK_THREAD_CPUTIME_ID)
>   func()
>   elapsed = clock_gettime(CLOCK_THREAD_CPUTIME_ID) - now
>
> Since 5.15 upgrade, we observed long `elapsed` in the range of 4-10ms much more
> frequently than before.  This went away after we disabled swap for the service
> using `memory.swap.max=0` memcg configuration.
>
> The new thing in 5.15 workingset_refault seems to be introduction of
> mem_cgroup_flush_stats()
> by commit 1f828223b7991a228bc2aef837b78737946d44b2 (memcg: flush
> lruvec stats in the
> refault).
>
> Given that mem_cgroup_flush_stats can take quite a long time for us on the
> standard systemd cgroupv2 hierrachy ( root / system.slice / workload.service )
>
>   sudo /usr/share/bcc/tools/funcslower -m 10 -t mem_cgroup_flush_stats
>   Tracing function calls slower than 10 ms... Ctrl+C to quit.
>   TIME       COMM           PID    LAT(ms)             RVAL FUNC
>   0.000000   <redacted>     804776   11.50              200
> mem_cgroup_flush_stats
>   0.343383   <redacted>     647496   10.58              200
> mem_cgroup_flush_stats
>   0.604309   <redacted>     804776   10.50              200
> mem_cgroup_flush_stats
>   1.230416   <redacted>     803293   10.01              200
> mem_cgroup_flush_stats
>   1.248442   <redacted>     646400   11.02              200
> mem_cgroup_flush_stats
>
> could it be possible that workingset_refault in some unfortunate case can take
> much longer than before such that it increases the time observed by
> CLOCK_THREAD_CPUTIME_ID from userspace, or overall duration of
> workingset_refault
> observed by perf ?

Ivan reported this issue last month and since then we have backported
three patches to 5.15-stable to fix the issue. These got merged into
5.15.18. However you are reporting the issue is still present 5.15.19.

Can you share a bit more detail on your hardware configuration (num of
cpus) and if possible the flamegraph?

thanks,
Shakeel