On 10/3/19 1:03 AM, David Rientjes wrote: > Hugetlb allocations use __GFP_RETRY_MAYFAIL to aggressively attempt to get > hugepages that the user needs. Commit b39d0ee2632d ("mm, page_alloc: > avoid expensive reclaim when compaction may not succeed") intends to > improve allocator behind for thp allocations to prevent excessive amounts > of reclaim especially when constrained to a single node. > > Since hugetlb allocations have explicitly preferred to loop and do reclaim > and compaction, exempt them from this new behavior at least for the time > being. It is not shown that hugetlb allocation success rate has been > impacted by commit b39d0ee2632d but hugetlb allocations are admittedly > beyond the scope of what the patch is intended to address (thp > allocations). > > Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx> > Signed-off-by: David Rientjes <rientjes@xxxxxxxxxx> > --- > Mike, you eluded that you may want to opt hugetlbfs out of this for the > time being in https://marc.info/?l=linux-kernel&m=156771690024533 -- I think the key differences between Mike's tests and Michal's is this part from Mike's mail linked above: "I 'tested' by simply creating some background activity and then seeing how many hugetlb pages could be allocated. Of course, many tries over time in a loop." - "some background activity" might be different than Michal's pre-filling of the memory with (clean) page cache - "many tries over time in a loop" could mean that kswapd has time to reclaim and eventually the new condition for pageblock order will pass every few retries, because there's enough memory for compaction and it won't return COMPACT_SKIPPED > not sure if you want to allow this excessive amount of reclaim for > hugetlb allocations or not given the swap storms Andrea has shown is More precisely this is about hugetlb reservations by admin, not allocations by the program. It's when admin uses the appropriate sysctl to say how many hugetlb pages to reserve. In that case they expect that memory will be reclaimed as needed. I don't think we should complicate the admin action by requiring e.g. a sync+drop_caches before that, or retrying in the loop. It's a one time action, not a continuous swap storm by a stream of THP allocations. > possible (and nr_hugepages_mempolicy does exist), but hugetlbfs was not > part of the problem we are trying to address here so no objection to > opting it out. > > You might want to consider how expensive hugetlb allocations can become > and disruptive to the system if it does not yield additional hugepages, Yes, there have been recent issues with the action not terminating properly in the case there's nothing more to reclaim (i.e. admin asking for an unrealistic number of hugetlb pages), but that has been addressed (IIRC already merged from mmotm to 5.4-rc1). It was actually an improvement to the reclaim/compaction feedback that everybody asks for, although the result is obviously still not perfect.