+Ivan Babrou Hi Daniel, On Wed, Feb 23, 2022 at 5:51 AM Daniel Dao <dqminh@xxxxxxxxxxxxxx> wrote: > > Hi all, > > We are observing some regressions in workingset_refault on our newly upgraded > 5.15.19 nodes with zram as swap. This manifests in several ways: > > 1) Regression of workingset_refault duration observed in flamegraph > > We regularly collect flamegraphs for running services on the node. Since upgrade > to 5.15.19, we see that workingset_refault occupied a more significant part of > the service flamegraph (13%) with the following call trace > > workingset_refault+0x128 > add_to_page_cache_lru+0x9f > page_cache_ra_unbounded+0x154 > force_page_cache_ra+0xe2 > filemap_get_pages+0xe9 > filemap_read+0xa4 > xfs_file_buffered_read+0x98 > xfs_file_read_iter+0x6a > new_sync_read+0x118 > vfs_read+0xf2 > __x64_sys_pread64+0x89 > do_syscall_64+0x3b > entry_SYSCALL_64_after_hwframe+0x44 > > 2) Regression of userspace performance sensitive code > > We have some performance sensentive code running in userspace that have their > runtime measured by CLOCK_THREAD_CPUTIME_ID. They look roughly as: > > now = clock_gettime(CLOCK_THREAD_CPUTIME_ID) > func() > elapsed = clock_gettime(CLOCK_THREAD_CPUTIME_ID) - now > > Since 5.15 upgrade, we observed long `elapsed` in the range of 4-10ms much more > frequently than before. This went away after we disabled swap for the service > using `memory.swap.max=0` memcg configuration. > > The new thing in 5.15 workingset_refault seems to be introduction of > mem_cgroup_flush_stats() > by commit 1f828223b7991a228bc2aef837b78737946d44b2 (memcg: flush > lruvec stats in the > refault). > > Given that mem_cgroup_flush_stats can take quite a long time for us on the > standard systemd cgroupv2 hierrachy ( root / system.slice / workload.service ) > > sudo /usr/share/bcc/tools/funcslower -m 10 -t mem_cgroup_flush_stats > Tracing function calls slower than 10 ms... Ctrl+C to quit. > TIME COMM PID LAT(ms) RVAL FUNC > 0.000000 <redacted> 804776 11.50 200 > mem_cgroup_flush_stats > 0.343383 <redacted> 647496 10.58 200 > mem_cgroup_flush_stats > 0.604309 <redacted> 804776 10.50 200 > mem_cgroup_flush_stats > 1.230416 <redacted> 803293 10.01 200 > mem_cgroup_flush_stats > 1.248442 <redacted> 646400 11.02 200 > mem_cgroup_flush_stats > > could it be possible that workingset_refault in some unfortunate case can take > much longer than before such that it increases the time observed by > CLOCK_THREAD_CPUTIME_ID from userspace, or overall duration of > workingset_refault > observed by perf ? Ivan reported this issue last month and since then we have backported three patches to 5.15-stable to fix the issue. These got merged into 5.15.18. However you are reporting the issue is still present 5.15.19. Can you share a bit more detail on your hardware configuration (num of cpus) and if possible the flamegraph? thanks, Shakeel