Re: [rfc] mm, hugetlb: allow hugepage allocations to excessively reclaim

Vlastimil Babka <vbabka@xxxxxxx> · Thu, 3 Oct 2019 10:14:58 +0200

On 10/3/19 1:03 AM, David Rientjes wrote:
> Hugetlb allocations use __GFP_RETRY_MAYFAIL to aggressively attempt to get 
> hugepages that the user needs.  Commit b39d0ee2632d ("mm, page_alloc: 
> avoid expensive reclaim when compaction may not succeed") intends to 
> improve allocator behind for thp allocations to prevent excessive amounts 
> of reclaim especially when constrained to a single node.
> 
> Since hugetlb allocations have explicitly preferred to loop and do reclaim 
> and compaction, exempt them from this new behavior at least for the time 
> being.  It is not shown that hugetlb allocation success rate has been 
> impacted by commit b39d0ee2632d but hugetlb allocations are admittedly 
> beyond the scope of what the patch is intended to address (thp 
> allocations).
> 
> Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
> Signed-off-by: David Rientjes <rientjes@xxxxxxxxxx>
> ---
>  Mike, you eluded that you may want to opt hugetlbfs out of this for the
>  time being in https://marc.info/?l=linux-kernel&m=156771690024533 --

I think the key differences between Mike's tests and Michal's is this part
from Mike's mail linked above:

"I 'tested' by simply creating some background activity and then seeing
how many hugetlb pages could be allocated. Of course, many tries over
time in a loop."

- "some background activity" might be different than Michal's pre-filling
  of the memory with (clean) page cache
- "many tries over time in a loop" could mean that kswapd has time to 
  reclaim and eventually the new condition for pageblock order will pass
  every few retries, because there's enough memory for compaction and it
  won't return COMPACT_SKIPPED

>  not sure if you want to allow this excessive amount of reclaim for 
>  hugetlb allocations or not given the swap storms Andrea has shown is

More precisely this is about hugetlb reservations by admin, not allocations
by the program. It's when admin uses the appropriate sysctl to say how many
hugetlb pages to reserve. In that case they expect that memory will be
reclaimed as needed. I don't think we should complicate the admin action
by requiring e.g. a sync+drop_caches before that, or retrying in the loop.
It's a one time action, not a continuous swap storm by a stream of THP
allocations.

>  possible (and nr_hugepages_mempolicy does exist), but hugetlbfs was not
>  part of the problem we are trying to address here so no objection to
>  opting it out.  
> 
>  You might want to consider how expensive hugetlb allocations can become
>  and disruptive to the system if it does not yield additional hugepages,

Yes, there have been recent issues with the action not terminating properly
in the case there's nothing more to reclaim (i.e. admin asking for an unrealistic
number of hugetlb pages), but that has been addressed (IIRC already merged
from mmotm to 5.4-rc1). It was actually an improvement to the reclaim/compaction
feedback that everybody asks for, although the result is obviously still
not perfect.