>> With commit f53af4285d77 ("mm: vmscan: fix extreme overreclaim >> and swap floods"), proactive reclaim still seems inaccurate. >> >> Our problematic scene also are almost anon pages. Request 1G >> by writing memory.reclaim will reclaim 1.7G or other values >> more than 1G by swapping. >> >> This try to fix the inaccurate reclaim problem. > > I can see how this happens. Direct and kswapd reclaim have much > smaller nr_to_reclaim targets, so it's less noticable when we loop a > few times. Proactive reclaim can come in with a rather large value. > > What does the reproducer setup look like? Are you calling reclaim on a > higher level cgroup with several children? Or is the looping coming > from having multiple zones alone? Thank you for your comment. The process in a leaf cgroup without children just malloc 20G anonymous memory and sleep, then calling reclaim in the leaf cgroup. Before commit f53af4285d77 ("mm: vmscan: fix extreme overreclaim and swap floods"), reclaimer may reclaim many times the amount of request. Now it should eventually reclaim in [request, 2 * request). >> Signed-off-by: Efly Young <yangyifei03@xxxxxxxxxxxx> >> --- >> mm/vmscan.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/mm/vmscan.c b/mm/vmscan.c >> index 9c1c5e8b..2aea8d9 100644 >> --- a/mm/vmscan.c >> +++ b/mm/vmscan.c >> @@ -6208,7 +6208,7 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) >> unsigned long nr_to_scan; >> enum lru_list lru; >> unsigned long nr_reclaimed = 0; >> - unsigned long nr_to_reclaim = sc->nr_to_reclaim; >> + unsigned long nr_to_reclaim = (sc->nr_to_reclaim - sc->nr_reclaimed); > > This can underflow. shrink_list() eats SWAP_CLUSTER_MAX batches out of > lru_pages >> priority, and only checks reclaimed > to_reclaim > after. This will then disable the bailout mechanism entirely. > > In general, I'm not sure this is the best spot to fix the problem: > > - During reclaim/compaction, should_continue_reclaim() may decide that > more reclaim is required before compaction can proceed. But the > second cycle might not do anything now, since you remember the work > done by the previous one. > > - shrink_node_memcgs() might do the full batch against the first > cgroup and not touch the second one anymore. This will result in > super lopsided behavior when you target a tree of multiple groups. > > There might be other spots that break, I haven't checked. > > You could go through them one by one, of course. But the truth is, > larger reclaim targets are the rare exception. Trying to support them > at the risk of breaking all other reclaim users seems ill-advised. I agree with your view. These explanations are more considerate. Thank you again for helping me out. > A better approach might be to just say: "don't call reclaim with large > numbers". Have proactive reclaim code handle the batching into smaller > chunks: > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index e8ca4bdcb03c..4b016806dcc7 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -6696,7 +6696,7 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > lru_add_drain_all(); > > reclaimed = try_to_free_mem_cgroup_pages(memcg, > - nr_to_reclaim - nr_reclaimed, > + min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), > GFP_KERNEL, reclaim_options); > > if (!reclaimed && !nr_retries--) May be this way could solve the inaccurate proactive reclaim problem without breaking the original balance. But may be less efficient than before?