Here's a typical graph: https://imgur.com/a/n03x5yH * Green (numa0) and blue (numa1) for 4.19 * Yellow (numa0) and orange (numa1) for 5.4 These downward slopes on numa0 on 5.4 are somewhat typical to the worst case scenario. If I try to clean up data a bit from a bunch of machines, this is how numa0 compares to numa1 with 1h average values of free memory above 5GiB: * https://imgur.com/a/6T4rRzi I think it's safe to say that numa0 is much much worse, but I cannot be 100% sure that numa1 is free from adverse effects, they may be just hiding in the noise caused by rolling reboots. On Tue, Feb 11, 2020 at 2:16 AM Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote: > > On Fri, Feb 07, 2020 at 02:54:43PM -0800, Ivan Babrou wrote: > > This change from 5.5 times: > > > > * https://github.com/torvalds/linux/commit/1c30844d2dfe > > > > > mm: reclaim small amounts of memory when an external fragmentation event occurs > > > > Introduced undesired effects in our environment. > > > > * NUMA with 2 x CPU > > * 128GB of RAM > > * THP disabled > > * Upgraded from 4.19 to 5.4 > > > > Before we saw free memory hover at around 1.4GB with no spikes. After > > the upgrade we saw some machines decide that they need a lot more than > > that, with frequent spikes above 10GB, often only on a single numa > > node. > > > > We can see kswapd quite active in balance_pgdat (it didn't look like > > it slept at all): > > > > $ ps uax | fgrep kswapd > > root 1850 23.0 0.0 0 0 ? R Jan30 1902:24 [kswapd0] > > root 1851 1.8 0.0 0 0 ? S Jan30 152:16 [kswapd1] > > > > This in turn massively increased pressure on page cache, which did not > > go well to services that depend on having a quick response from a > > local cache backed by solid storage. > > > > Here's how it looked like when I zeroed vm.watermark_boost_factor: > > > > * https://imgur.com/a/6IZWicU > > > > IO subsided from 100% busy in page cache population at 300MB/s on a > > single SATA drive down to under 100MB/s. > > > > This sort of regression doesn't seem like a good thing. > > It is not a good thing, so thanks for the report. Obviously I have not > seen something similar or least not severe enough to show up on my radar. > I'd seen some increases with reclaim activity affecting benchmarks that > rely on use-twice data remaining resident but nothing severe enough to > warrant action. > > Can you tell me if it is *always* node 0 that shows crazy activity? I > ask because some conditions would have to be met for the boost to always > apply. It's already a per-zone attribute but it is treated indirectly as a > pgdat property. What I'm thinking is that on node 0, the DMA32 or DMA zone > gets boosted but vmscan then reclaims from higher zones until the boost is > removed. That would excessively reclaim memory but be specific to node 0. > > I've cc'd Rik as he says he saw something similar even on single node > systems. The boost applying to lower zones would still affect single > node systems but NUMA machines always getting impacted by boost would > show that the boost really needs to be a per-node flag. Sure, we *could* > apply the reclaim to just the lower zones but that potentially means a > *lot* of scan activity -- potentially 124G of pages before a lower zone > page is found on Ivan's machine. That might be the very situation being > encountered here. > > An alternative is that boosting is only ever applied to the highest > populated zone in a system. The intent of the patch was primarily about > THP which can use any zone to reduce their allocaation latency. While > it's possible that there are cases where the latency of other orders > matter *and* they require lower zones, I think it's unlikely and that > this would be a safer option overall. > > However, overall I think the simpliest is to abort the boosting if > reclaim is reaching higher priorities without being able to clear > the boost. The boost is best-effort to reduce allocation latency in > the future. This approach still has some overhead as there is a reclaim > pass but kswapd will abort and go to sleep if the normal watermarks > are met. > > This is build tested only. Ideally someone on the cc has a test case > that can reproduce this specific problem of excessive kswapd activity. > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 572fb17c6273..71dd47172cef 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -3462,6 +3462,25 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx) > return false; > } > > +static void acct_boosted_reclaim(pg_data_t *pgdat, int classzone_idx, > + unsigned long *zone_boosts) > +{ > + struct zone *zone; > + unsigned long flags; > + int i; > + > + for (i = 0; i <= classzone_idx; i++) { > + if (!zone_boosts[i]) > + continue; > + > + /* Increments are under the zone lock */ > + zone = pgdat->node_zones + i; > + spin_lock_irqsave(&zone->lock, flags); > + zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]); > + spin_unlock_irqrestore(&zone->lock, flags); > + } > +} > + > /* Clear pgdat state for congested, dirty or under writeback. */ > static void clear_pgdat_congested(pg_data_t *pgdat) > { > @@ -3654,9 +3673,17 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) > if (!nr_boost_reclaim && balanced) > goto out; > > - /* Limit the priority of boosting to avoid reclaim writeback */ > - if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2) > - raise_priority = false; > + /* > + * Abort boosting if reclaiming at higher priority is not > + * working to avoid excessive reclaim due to lower zones > + * being boosted. > + */ > + if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2) { > + acct_boosted_reclaim(pgdat, classzone_idx, zone_boosts); > + boosted = false; > + nr_boost_reclaim = 0; > + goto restart; > + } > > /* > * Do not writeback or swap pages for boosted reclaim. The > @@ -3738,18 +3765,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) > out: > /* If reclaim was boosted, account for the reclaim done in this pass */ > if (boosted) { > - unsigned long flags; > - > - for (i = 0; i <= classzone_idx; i++) { > - if (!zone_boosts[i]) > - continue; > - > - /* Increments are under the zone lock */ > - zone = pgdat->node_zones + i; > - spin_lock_irqsave(&zone->lock, flags); > - zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]); > - spin_unlock_irqrestore(&zone->lock, flags); > - } > + acct_boosted_reclaim(pgdat, classzone_idx, zone_boosts); > > /* > * As there is now likely space, wakeup kcompact to defragment