Re: Reclaim regression after 1c30844d2dfe

Hillf Danton <hdanton@xxxxxxxx> · Sat, 8 Feb 2020 19:11:33 +0800

On Fri, 7 Feb 2020 14:54:43 -0800 Ivan Babrou wrote:
> This change from 5.5 times:
> 
> * https://github.com/torvalds/linux/commit/1c30844d2dfe
> 
> > mm: reclaim small amounts of memory when an external fragmentation event occurs
> 
> Introduced undesired effects in our environment.
> 
> * NUMA with 2 x CPU
> * 128GB of RAM
> * THP disabled
> * Upgraded from 4.19 to 5.4
> 
> Before we saw free memory hover at around 1.4GB with no spikes. After
> the upgrade we saw some machines decide that they need a lot more than
> that, with frequent spikes above 10GB, often only on a single numa
> node.
> 
> We can see kswapd quite active in balance_pgdat (it didn't look like
> it slept at all):
> 
> $ ps uax | fgrep kswapd
> root       1850 23.0  0.0      0     0 ?        R    Jan30 1902:24 [kswapd0]
> root       1851  1.8  0.0      0     0 ?        S    Jan30 152:16 [kswapd1]
> 
> This in turn massively increased pressure on page cache, which did not
> go well to services that depend on having a quick response from a
> local cache backed by solid storage.
> 
> Here's how it looked like when I zeroed vm.watermark_boost_factor:
> 
> * https://imgur.com/a/6IZWicU
> 
> IO subsided from 100% busy in page cache population at 300MB/s on a
> single SATA drive down to under 100MB/s.
> 
> This sort of regression doesn't seem like a good thing.

Here are two small diffs :P

[1] cleanup: stop reclaiming pages once balanced.

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3641,6 +3641,9 @@ restart:
 		 * re-evaluate if boosting is required when kswapd next wakes.
 		 */
 		balanced = pgdat_balanced(pgdat, sc.order, classzone_idx);
+		if (balanced)
+			break;
+
 		if (!balanced && nr_boost_reclaim) {
 			nr_boost_reclaim = 0;
 			goto restart;
--

[2] restore the old behavior by ignoring boost before falling in hot water.

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3572,7 +3572,7 @@ static int balance_pgdat(pg_data_t *pgda
 	unsigned long pflags;
 	unsigned long nr_boost_reclaim;
 	unsigned long zone_boosts[MAX_NR_ZONES] = { 0, };
-	bool boosted;
+	bool boosted = false;
 	struct zone *zone;
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
@@ -3591,18 +3591,22 @@ static int balance_pgdat(pg_data_t *pgda
 	 * place so that parallel allocations that are near the watermark will
 	 * stall or direct reclaim until kswapd is finished.
 	 */
+restart:
 	nr_boost_reclaim = 0;
 	for (i = 0; i <= classzone_idx; i++) {
 		zone = pgdat->node_zones + i;
 		if (!managed_zone(zone))
 			continue;
 
+		if (boosted) {
+			zone->watermark_boost = 0;
+			continue;
+		}
 		nr_boost_reclaim += zone->watermark_boost;
 		zone_boosts[i] = zone->watermark_boost;
 	}
 	boosted = nr_boost_reclaim;
 
-restart:
 	sc.priority = DEF_PRIORITY;
 	do {
 		unsigned long nr_reclaimed = sc.nr_reclaimed;
@@ -3644,10 +3648,9 @@ restart:
 		if (balanced)
 			break;
 
-		if (!balanced && nr_boost_reclaim) {
-			nr_boost_reclaim = 0;
+		/* Limit the priority of boosting to avoid reclaim writeback */
+		if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
 			goto restart;
-		}
 
 		/*
 		 * If boosting is not active then only reclaim if there are no
--