Regression in workingset_refault latency on 5.15

Daniel Dao <dqminh@xxxxxxxxxxxxxx> · Wed, 23 Feb 2022 13:51:18 +0000

Hi all,

We are observing some regressions in workingset_refault on our newly upgraded
5.15.19 nodes with zram as swap. This manifests in several ways:

1) Regression of workingset_refault duration observed in flamegraph

We regularly collect flamegraphs for running services on the node. Since upgrade
to 5.15.19, we see that workingset_refault occupied a more significant part of
the service flamegraph (13%) with the following call trace

  workingset_refault+0x128
  add_to_page_cache_lru+0x9f
  page_cache_ra_unbounded+0x154
  force_page_cache_ra+0xe2
  filemap_get_pages+0xe9
  filemap_read+0xa4
  xfs_file_buffered_read+0x98
  xfs_file_read_iter+0x6a
  new_sync_read+0x118
  vfs_read+0xf2
  __x64_sys_pread64+0x89
  do_syscall_64+0x3b
  entry_SYSCALL_64_after_hwframe+0x44

2) Regression of userspace performance sensitive code

We have some performance sensentive code running in userspace that have their
runtime measured by CLOCK_THREAD_CPUTIME_ID. They look roughly as:

  now = clock_gettime(CLOCK_THREAD_CPUTIME_ID)
  func()
  elapsed = clock_gettime(CLOCK_THREAD_CPUTIME_ID) - now

Since 5.15 upgrade, we observed long `elapsed` in the range of 4-10ms much more
frequently than before.  This went away after we disabled swap for the service
using `memory.swap.max=0` memcg configuration.

The new thing in 5.15 workingset_refault seems to be introduction of
mem_cgroup_flush_stats()
by commit 1f828223b7991a228bc2aef837b78737946d44b2 (memcg: flush
lruvec stats in the
refault).

Given that mem_cgroup_flush_stats can take quite a long time for us on the
standard systemd cgroupv2 hierrachy ( root / system.slice / workload.service )

  sudo /usr/share/bcc/tools/funcslower -m 10 -t mem_cgroup_flush_stats
  Tracing function calls slower than 10 ms... Ctrl+C to quit.
  TIME       COMM           PID    LAT(ms)             RVAL FUNC
  0.000000   <redacted>     804776   11.50              200
mem_cgroup_flush_stats
  0.343383   <redacted>     647496   10.58              200
mem_cgroup_flush_stats
  0.604309   <redacted>     804776   10.50              200
mem_cgroup_flush_stats
  1.230416   <redacted>     803293   10.01              200
mem_cgroup_flush_stats
  1.248442   <redacted>     646400   11.02              200
mem_cgroup_flush_stats

could it be possible that workingset_refault in some unfortunate case can take
much longer than before such that it increases the time observed by
CLOCK_THREAD_CPUTIME_ID from userspace, or overall duration of
workingset_refault
observed by perf ?