workingset transition detection corner case

Vlastimil Babka <vbabka@xxxxxxx> · Fri, 13 Dec 2019 16:38:38 +0100

Hi Johannes,

we have been debugging an issue reported against our 4.12-based kernel,
where a DB-based workload would start thrashing badly at some point,
making the system unusable. This didn't happen when replacing the kernel
with older 4.4-based one (and keeping everything else the same).

Unfortunately we don't have the reproducer in-house and the conditions
might be also configuration specific (rootfs is on NFS), but we provided
vmstat monitoring instructions and later tracing and from the data we
got we found that the workload at some point fills almost the whole
memory with anonymous pages (namely shmem), pushing almost the whole
page cache out, and filling part of the swap. The 4.4-based kernel then
recovers quickly without excessive anon swapping, which suggests the
shmem pages stop being frequently accessed. However the 4.12-based
kernel is unable to recover and grow the page cache back (both active
and inactive) and keeps thrashing on it.

We have considered the large upstream changes between 4.4 and 4.12 which
include memcg awareness (but there's a single memcg and disabling memcg
makes no difference) and node-based reclaim (there's no
disproportionally sized zone). Then we suspected 4.12 commit
2a2e48854d70 ("mm: vmscan: fix IO/refault regression in cache workingset
transition") and how it affects inactive_list_is_low() when called from
shrink_list() - the theory was that we decide to shrink file active list
too much (by setting inactive_ratio=0) due to refault detection, which
in turn means we shrink file pages too much. This was confirmed by
removing the inactive_ratio=0 part, after which the 4.12-based kernel
stopped thrashing with the workload.

Then we investigated what leads to the main condition of the logic -
"lruvec->refaults != refaults", by adding some more tracing to
inactive_list_is_low() and snapshot_refaults(). We suspected bad
interactions due to multiple direct reclaimers, but what I mostly see is
the following pattern of kswapd activity:

- kswapd finishes balancing, makes a snapshot of lruvec->refaults
- after a while (can be up to few seconds) kswapd is woken up again and
the number of refaults meanwhile is changed by some relatively small
number (tens or hundreds) since the snapshot, so the condition
"lruvec->refaults != refaults" becomes true.
- inactive_list_is_low() keeps being called as part of kswapd operation,
always the condition is true as the snapshot didn't change. During that
time, the refaults counter is either unchanged or changes only by a few
refaults. Thus, the whole kswapd activity on the file lru is focused on
the active lru.

Since the intention of commit 2a2e48854d70 is to detect workingset
transitions, it seems to me it's not working well in this case, as
there's no such transition - the workload just cannot keep its page
cache working set in memory, because it's excessively reclaimed instead
of anonymous memory. The '!=' condition is perhaps too coarse and static
and doesn't reflect how many refaults there were or if refaults keep
happening during kswapd operation - a single refault between two kswapd
runs can affect the whole second run. I wonder if there shouldn't be at
least some kind of decay - when the condition triggers, update the
snapshot to a value between the old snapshot and current value, so if
refaults do not keep occuring, after some number of calls the condition
will stop being true? What do you think?

I should also mention that we don't have the relatively recent commit
2c012a4ad1a2 ("mm: vmscan: scan anonymous pages on file refaults") in
the 4.12-based kernel. It could in theory make the problem also go away,
as the "excessively true" condition would now also be considered when
inactive_list_is_low() is called from get_scan_count() (in v5.4; I know
there were big reorganizations in last merge window), and perhaps change
some SCAN_FILE outcomes to SCAN_FRACT. But I think it would be better to
do something with the root cause first.