On Fri, 7 Feb 2020 14:54:43 -0800 Ivan Babrou wrote: > This change from 5.5 times: > > * https://github.com/torvalds/linux/commit/1c30844d2dfe > > > mm: reclaim small amounts of memory when an external fragmentation event occurs > > Introduced undesired effects in our environment. > > * NUMA with 2 x CPU > * 128GB of RAM > * THP disabled > * Upgraded from 4.19 to 5.4 > > Before we saw free memory hover at around 1.4GB with no spikes. After > the upgrade we saw some machines decide that they need a lot more than > that, with frequent spikes above 10GB, often only on a single numa > node. > > We can see kswapd quite active in balance_pgdat (it didn't look like > it slept at all): > > $ ps uax | fgrep kswapd > root 1850 23.0 0.0 0 0 ? R Jan30 1902:24 [kswapd0] > root 1851 1.8 0.0 0 0 ? S Jan30 152:16 [kswapd1] > > This in turn massively increased pressure on page cache, which did not > go well to services that depend on having a quick response from a > local cache backed by solid storage. > > Here's how it looked like when I zeroed vm.watermark_boost_factor: > > * https://imgur.com/a/6IZWicU > > IO subsided from 100% busy in page cache population at 300MB/s on a > single SATA drive down to under 100MB/s. > > This sort of regression doesn't seem like a good thing. Here are two small diffs :P [1] cleanup: stop reclaiming pages once balanced. --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3641,6 +3641,9 @@ restart: * re-evaluate if boosting is required when kswapd next wakes. */ balanced = pgdat_balanced(pgdat, sc.order, classzone_idx); + if (balanced) + break; + if (!balanced && nr_boost_reclaim) { nr_boost_reclaim = 0; goto restart; -- [2] restore the old behavior by ignoring boost before falling in hot water. --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3572,7 +3572,7 @@ static int balance_pgdat(pg_data_t *pgda unsigned long pflags; unsigned long nr_boost_reclaim; unsigned long zone_boosts[MAX_NR_ZONES] = { 0, }; - bool boosted; + bool boosted = false; struct zone *zone; struct scan_control sc = { .gfp_mask = GFP_KERNEL, @@ -3591,18 +3591,22 @@ static int balance_pgdat(pg_data_t *pgda * place so that parallel allocations that are near the watermark will * stall or direct reclaim until kswapd is finished. */ +restart: nr_boost_reclaim = 0; for (i = 0; i <= classzone_idx; i++) { zone = pgdat->node_zones + i; if (!managed_zone(zone)) continue; + if (boosted) { + zone->watermark_boost = 0; + continue; + } nr_boost_reclaim += zone->watermark_boost; zone_boosts[i] = zone->watermark_boost; } boosted = nr_boost_reclaim; -restart: sc.priority = DEF_PRIORITY; do { unsigned long nr_reclaimed = sc.nr_reclaimed; @@ -3644,10 +3648,9 @@ restart: if (balanced) break; - if (!balanced && nr_boost_reclaim) { - nr_boost_reclaim = 0; + /* Limit the priority of boosting to avoid reclaim writeback */ + if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2) goto restart; - } /* * If boosting is not active then only reclaim if there are no --