Re: [PATCH] Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone"

Vlastimil Babka <vbabka@xxxxxxx> · Wed, 26 Feb 2025 14:07:27 +0100

On 2/26/25 7:54 AM, Michal Hocko wrote:
> On Tue 25-02-25 22:22:58, Gabriel Krisman Bertazi wrote:
>> Commit 96a5c186efff ("mm/page_alloc.c: don't show protection in zone's
>> ->lowmem_reserve[] for empty zone") removes the protection of lower
>> zones from allocations targeting memory-less high zones.  This had an
>> unintended impact on the pattern of reclaims because it makes the
>> high-zone-targeted allocation more likely to succeed in lower zones,
>> which adds pressure to said zones.  I.e, the following corresponding
>> checks in zone_watermark_ok/zone_watermark_fast are less likely to
>> trigger:
>>
>>         if (free_pages <= min + z->lowmem_reserve[highest_zoneidx])
>>                 return false;
>>
>> As a result, we are observing an increase in reclaim and kswapd scans,
>> due to the increased pressure.  This was initially observed as increased
>> latency in filesystem operations when benchmarking with fio on a machine
>> with some memory-less zones, but it has since been associated with
>> increased contention in locks related to memory reclaim.  By reverting
>> this patch, the original performance was recovered on that machine.
> 
> I think it would be nice to show the memory layout on that machine (is
> there any movable or device zone)?
> 
> Exact reclaim patterns are really hard to predict and it is little bit
> surprising the said patch has caused an increased kswapd activity
> because I would expect that there will be more reclaim with the lowmem
> reserves in place. But it is quite possible that the higher zone memory
> pressure is just tipping over and increase the lowmem pressure enough
> that it shows up.

My theory is that the commit caused a difference between kernel and
userspace allocations, with bad consequences. Kernel allocation will
have highest_zoneidx = NORMAL and thus observe the lowmem_reserve for
for ZONE_DMA32 unchanged. Userspace allocation will have highest_zoneidx
= MOVABLE and thus will see zero lowmem_reserve and will allocate from
ZONE_DMA32 (or even ZONE_DMA) when previously it wouldn't.

Then a kernel allocation might happen to wake up kswapd, which will see
the DMA/DMA32 below watermark (with NORMAL highest_zoneidx) and try to
reclaim them back above the watermarks. Since the LRU lists are per-node
and nor per-zone anymore, it will spend a lot of effort to find pages
from DMA/DMA32 to reclaim.

> In any case 96a5c186efff seems incorrect because it assumes that the
> protection has anything to do with how higher zone is populated while
> the protection fundamentaly protects lower zone from higher zones
> allocation. Those allocations are independent on the actual memory in
> that zone.
> 
>> The original commit was introduced as a clarification of the
>> /proc/zoneinfo output, so it doesn't seem there are usecases depending
>> on it, making the revert a simple solution.
>>
>> Cc: Michal Hocko <mhocko@xxxxxxxxxx>
>> Cc: Mel Gorman <mgorman@xxxxxxx>
>> Cc: Vlastimil Babka <vbabka@xxxxxxx>
>> Cc: Baoquan He <bhe@xxxxxxxxxx>
>> Fixes: 96a5c186efff ("mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone")
>> Signed-off-by: Gabriel Krisman Bertazi <krisman@xxxxxxx>
> 
> Acked-by: Michal Hocko <mhocko@xxxxxxxx>
> Thanks!
> 
>> ---
>>  mm/page_alloc.c | 3 +--
>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 579789600a3c..fe986e6de7a0 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -5849,11 +5849,10 @@ static void setup_per_zone_lowmem_reserve(void)
>>  
>>  			for (j = i + 1; j < MAX_NR_ZONES; j++) {
>>  				struct zone *upper_zone = &pgdat->node_zones[j];
>> -				bool empty = !zone_managed_pages(upper_zone);
>>  
>>  				managed_pages += zone_managed_pages(upper_zone);
>>  
>> -				if (clear || empty)
>> +				if (clear)
>>  					zone->lowmem_reserve[j] = 0;
>>  				else
>>  					zone->lowmem_reserve[j] = managed_pages / ratio;
>> -- 
>> 2.47.0
>