On 2/26/25 7:54 AM, Michal Hocko wrote: > On Tue 25-02-25 22:22:58, Gabriel Krisman Bertazi wrote: >> Commit 96a5c186efff ("mm/page_alloc.c: don't show protection in zone's >> ->lowmem_reserve[] for empty zone") removes the protection of lower >> zones from allocations targeting memory-less high zones. This had an >> unintended impact on the pattern of reclaims because it makes the >> high-zone-targeted allocation more likely to succeed in lower zones, >> which adds pressure to said zones. I.e, the following corresponding >> checks in zone_watermark_ok/zone_watermark_fast are less likely to >> trigger: >> >> if (free_pages <= min + z->lowmem_reserve[highest_zoneidx]) >> return false; >> >> As a result, we are observing an increase in reclaim and kswapd scans, >> due to the increased pressure. This was initially observed as increased >> latency in filesystem operations when benchmarking with fio on a machine >> with some memory-less zones, but it has since been associated with >> increased contention in locks related to memory reclaim. By reverting >> this patch, the original performance was recovered on that machine. > > I think it would be nice to show the memory layout on that machine (is > there any movable or device zone)? > > Exact reclaim patterns are really hard to predict and it is little bit > surprising the said patch has caused an increased kswapd activity > because I would expect that there will be more reclaim with the lowmem > reserves in place. But it is quite possible that the higher zone memory > pressure is just tipping over and increase the lowmem pressure enough > that it shows up. My theory is that the commit caused a difference between kernel and userspace allocations, with bad consequences. Kernel allocation will have highest_zoneidx = NORMAL and thus observe the lowmem_reserve for for ZONE_DMA32 unchanged. Userspace allocation will have highest_zoneidx = MOVABLE and thus will see zero lowmem_reserve and will allocate from ZONE_DMA32 (or even ZONE_DMA) when previously it wouldn't. Then a kernel allocation might happen to wake up kswapd, which will see the DMA/DMA32 below watermark (with NORMAL highest_zoneidx) and try to reclaim them back above the watermarks. Since the LRU lists are per-node and nor per-zone anymore, it will spend a lot of effort to find pages from DMA/DMA32 to reclaim. > In any case 96a5c186efff seems incorrect because it assumes that the > protection has anything to do with how higher zone is populated while > the protection fundamentaly protects lower zone from higher zones > allocation. Those allocations are independent on the actual memory in > that zone. > >> The original commit was introduced as a clarification of the >> /proc/zoneinfo output, so it doesn't seem there are usecases depending >> on it, making the revert a simple solution. >> >> Cc: Michal Hocko <mhocko@xxxxxxxxxx> >> Cc: Mel Gorman <mgorman@xxxxxxx> >> Cc: Vlastimil Babka <vbabka@xxxxxxx> >> Cc: Baoquan He <bhe@xxxxxxxxxx> >> Fixes: 96a5c186efff ("mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone") >> Signed-off-by: Gabriel Krisman Bertazi <krisman@xxxxxxx> > > Acked-by: Michal Hocko <mhocko@xxxxxxxx> > Thanks! > >> --- >> mm/page_alloc.c | 3 +-- >> 1 file changed, 1 insertion(+), 2 deletions(-) >> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index 579789600a3c..fe986e6de7a0 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -5849,11 +5849,10 @@ static void setup_per_zone_lowmem_reserve(void) >> >> for (j = i + 1; j < MAX_NR_ZONES; j++) { >> struct zone *upper_zone = &pgdat->node_zones[j]; >> - bool empty = !zone_managed_pages(upper_zone); >> >> managed_pages += zone_managed_pages(upper_zone); >> >> - if (clear || empty) >> + if (clear) >> zone->lowmem_reserve[j] = 0; >> else >> zone->lowmem_reserve[j] = managed_pages / ratio; >> -- >> 2.47.0 >