Re: Regression in workingset_refault latency on 5.15

Shakeel Butt <shakeelb@xxxxxxxxxx> · Wed, 23 Feb 2022 08:00:44 -0800

On Wed, Feb 23, 2022 at 7:57 AM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote:
>
> +Ivan Babrou
>
> Hi Daniel,
>
> On Wed, Feb 23, 2022 at 5:51 AM Daniel Dao <dqminh@xxxxxxxxxxxxxx> wrote:
> >
> > Hi all,
> >
> > We are observing some regressions in workingset_refault on our newly upgraded
> > 5.15.19 nodes with zram as swap. This manifests in several ways:
> >
> > 1) Regression of workingset_refault duration observed in flamegraph
> >
> > We regularly collect flamegraphs for running services on the node. Since upgrade
> > to 5.15.19, we see that workingset_refault occupied a more significant part of
> > the service flamegraph (13%) with the following call trace
> >
> >   workingset_refault+0x128
> >   add_to_page_cache_lru+0x9f
> >   page_cache_ra_unbounded+0x154
> >   force_page_cache_ra+0xe2
> >   filemap_get_pages+0xe9
> >   filemap_read+0xa4
> >   xfs_file_buffered_read+0x98
> >   xfs_file_read_iter+0x6a
> >   new_sync_read+0x118
> >   vfs_read+0xf2
> >   __x64_sys_pread64+0x89
> >   do_syscall_64+0x3b
> >   entry_SYSCALL_64_after_hwframe+0x44
> >
> > 2) Regression of userspace performance sensitive code
> >
> > We have some performance sensentive code running in userspace that have their
> > runtime measured by CLOCK_THREAD_CPUTIME_ID. They look roughly as:
> >
> >   now = clock_gettime(CLOCK_THREAD_CPUTIME_ID)
> >   func()
> >   elapsed = clock_gettime(CLOCK_THREAD_CPUTIME_ID) - now
> >
> > Since 5.15 upgrade, we observed long `elapsed` in the range of 4-10ms much more
> > frequently than before.  This went away after we disabled swap for the service
> > using `memory.swap.max=0` memcg configuration.
> >
> > The new thing in 5.15 workingset_refault seems to be introduction of
> > mem_cgroup_flush_stats()
> > by commit 1f828223b7991a228bc2aef837b78737946d44b2 (memcg: flush
> > lruvec stats in the
> > refault).
> >
> > Given that mem_cgroup_flush_stats can take quite a long time for us on the
> > standard systemd cgroupv2 hierrachy ( root / system.slice / workload.service )
> >
> >   sudo /usr/share/bcc/tools/funcslower -m 10 -t mem_cgroup_flush_stats
> >   Tracing function calls slower than 10 ms... Ctrl+C to quit.
> >   TIME       COMM           PID    LAT(ms)             RVAL FUNC
> >   0.000000   <redacted>     804776   11.50              200
> > mem_cgroup_flush_stats
> >   0.343383   <redacted>     647496   10.58              200
> > mem_cgroup_flush_stats
> >   0.604309   <redacted>     804776   10.50              200
> > mem_cgroup_flush_stats
> >   1.230416   <redacted>     803293   10.01              200
> > mem_cgroup_flush_stats
> >   1.248442   <redacted>     646400   11.02              200
> > mem_cgroup_flush_stats
> >
> > could it be possible that workingset_refault in some unfortunate case can take
> > much longer than before such that it increases the time observed by
> > CLOCK_THREAD_CPUTIME_ID from userspace, or overall duration of
> > workingset_refault
> > observed by perf ?
>
> Ivan reported this issue last month and since then we have backported
> three patches to 5.15-stable to fix the issue. These got merged into
> 5.15.18. However you are reporting the issue is still present 5.15.19.
>
> Can you share a bit more detail on your hardware configuration (num of
> cpus) and if possible the flamegraph?
>

Also if you can reproduce the issue, can you try the patch at
https://lore.kernel.org/all/20210929235936.2859271-1-shakeelb@xxxxxxxxxx/
?