Re: [PATCH] mm:vmscan: fix inaccurate reclaim during proactive reclaim

Johannes Weiner <hannes@xxxxxxxxxxx> · Tue, 11 Jul 2023 11:28:10 -0400

On Fri, Jul 07, 2023 at 06:32:26PM +0800, Efly Young wrote:
> With commit f53af4285d77 ("mm: vmscan: fix extreme overreclaim
> and swap floods"), proactive reclaim still seems inaccurate.
> 
> Our problematic scene also are almost anon pages. Request 1G
> by writing memory.reclaim will reclaim 1.7G or other values
> more than 1G by swapping.
> 
> This try to fix the inaccurate reclaim problem.

I can see how this happens. Direct and kswapd reclaim have much
smaller nr_to_reclaim targets, so it's less noticable when we loop a
few times. Proactive reclaim can come in with a rather large value.

What does the reproducer setup look like? Are you calling reclaim on a
higher level cgroup with several children? Or is the looping coming
from having multiple zones alone?

> Signed-off-by: Efly Young <yangyifei03@xxxxxxxxxxxx>
> ---
>  mm/vmscan.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9c1c5e8b..2aea8d9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -6208,7 +6208,7 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>  	unsigned long nr_to_scan;
>  	enum lru_list lru;
>  	unsigned long nr_reclaimed = 0;
> -	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> +	unsigned long nr_to_reclaim = (sc->nr_to_reclaim - sc->nr_reclaimed);

This can underflow. shrink_list() eats SWAP_CLUSTER_MAX batches out of
lru_pages >> priority, and only checks reclaimed > to_reclaim
after. This will then disable the bailout mechanism entirely.

In general, I'm not sure this is the best spot to fix the problem:

- During reclaim/compaction, should_continue_reclaim() may decide that
  more reclaim is required before compaction can proceed. But the
  second cycle might not do anything now, since you remember the work
  done by the previous one.

- shrink_node_memcgs() might do the full batch against the first
  cgroup and not touch the second one anymore. This will result in
  super lopsided behavior when you target a tree of multiple groups.

There might be other spots that break, I haven't checked.

You could go through them one by one, of course. But the truth is,
larger reclaim targets are the rare exception. Trying to support them
at the risk of breaking all other reclaim users seems ill-advised.

A better approach might be to just say: "don't call reclaim with large
numbers". Have proactive reclaim code handle the batching into smaller
chunks:

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e8ca4bdcb03c..4b016806dcc7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6696,7 +6696,7 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 			lru_add_drain_all();
 
 		reclaimed = try_to_free_mem_cgroup_pages(memcg,
-						nr_to_reclaim - nr_reclaimed,
+						min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
 						GFP_KERNEL, reclaim_options);
 
 		if (!reclaimed && !nr_retries--)