On Wed, Feb 23, 2022 at 7:57 AM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote: > > +Ivan Babrou > > Hi Daniel, > > On Wed, Feb 23, 2022 at 5:51 AM Daniel Dao <dqminh@xxxxxxxxxxxxxx> wrote: > > > > Hi all, > > > > We are observing some regressions in workingset_refault on our newly upgraded > > 5.15.19 nodes with zram as swap. This manifests in several ways: > > > > 1) Regression of workingset_refault duration observed in flamegraph > > > > We regularly collect flamegraphs for running services on the node. Since upgrade > > to 5.15.19, we see that workingset_refault occupied a more significant part of > > the service flamegraph (13%) with the following call trace > > > > workingset_refault+0x128 > > add_to_page_cache_lru+0x9f > > page_cache_ra_unbounded+0x154 > > force_page_cache_ra+0xe2 > > filemap_get_pages+0xe9 > > filemap_read+0xa4 > > xfs_file_buffered_read+0x98 > > xfs_file_read_iter+0x6a > > new_sync_read+0x118 > > vfs_read+0xf2 > > __x64_sys_pread64+0x89 > > do_syscall_64+0x3b > > entry_SYSCALL_64_after_hwframe+0x44 > > > > 2) Regression of userspace performance sensitive code > > > > We have some performance sensentive code running in userspace that have their > > runtime measured by CLOCK_THREAD_CPUTIME_ID. They look roughly as: > > > > now = clock_gettime(CLOCK_THREAD_CPUTIME_ID) > > func() > > elapsed = clock_gettime(CLOCK_THREAD_CPUTIME_ID) - now > > > > Since 5.15 upgrade, we observed long `elapsed` in the range of 4-10ms much more > > frequently than before. This went away after we disabled swap for the service > > using `memory.swap.max=0` memcg configuration. > > > > The new thing in 5.15 workingset_refault seems to be introduction of > > mem_cgroup_flush_stats() > > by commit 1f828223b7991a228bc2aef837b78737946d44b2 (memcg: flush > > lruvec stats in the > > refault). > > > > Given that mem_cgroup_flush_stats can take quite a long time for us on the > > standard systemd cgroupv2 hierrachy ( root / system.slice / workload.service ) > > > > sudo /usr/share/bcc/tools/funcslower -m 10 -t mem_cgroup_flush_stats > > Tracing function calls slower than 10 ms... Ctrl+C to quit. > > TIME COMM PID LAT(ms) RVAL FUNC > > 0.000000 <redacted> 804776 11.50 200 > > mem_cgroup_flush_stats > > 0.343383 <redacted> 647496 10.58 200 > > mem_cgroup_flush_stats > > 0.604309 <redacted> 804776 10.50 200 > > mem_cgroup_flush_stats > > 1.230416 <redacted> 803293 10.01 200 > > mem_cgroup_flush_stats > > 1.248442 <redacted> 646400 11.02 200 > > mem_cgroup_flush_stats > > > > could it be possible that workingset_refault in some unfortunate case can take > > much longer than before such that it increases the time observed by > > CLOCK_THREAD_CPUTIME_ID from userspace, or overall duration of > > workingset_refault > > observed by perf ? > > Ivan reported this issue last month and since then we have backported > three patches to 5.15-stable to fix the issue. These got merged into > 5.15.18. However you are reporting the issue is still present 5.15.19. > > Can you share a bit more detail on your hardware configuration (num of > cpus) and if possible the flamegraph? > Also if you can reproduce the issue, can you try the patch at https://lore.kernel.org/all/20210929235936.2859271-1-shakeelb@xxxxxxxxxx/ ?