Re: [Question] Should direct reclaim time be bounded?

Mike Kravetz <mike.kravetz@xxxxxxxxxx> · Thu, 4 Jul 2019 08:11:44 -0700

On 7/4/19 4:09 AM, Michal Hocko wrote:
> On Wed 03-07-19 16:54:35, Mike Kravetz wrote:
>> On 7/3/19 2:43 AM, Mel Gorman wrote:
>>> Indeed. I'm getting knocked offline shortly so I didn't give this the
>>> time it deserves but it appears that part of this problem is
>>> hugetlb-specific when one node is full and can enter into this continual
>>> loop due to __GFP_RETRY_MAYFAIL requiring both nr_reclaimed and
>>> nr_scanned to be zero.
>>
>> Yes, I am not aware of any other large order allocations consistently made
>> with __GFP_RETRY_MAYFAIL.  But, I did not look too closely.  Michal believes
>> that hugetlb pages allocations should use __GFP_RETRY_MAYFAIL.
> 
> Yes. The argument is that this is controlable by an admin and failures
> should be prevented as much as possible. I didn't get to understand
> should_continue_reclaim part of the problem but I have a strong feeling
> that __GFP_RETRY_MAYFAIL handling at that layer is not correct. What
> happens if it is simply removed and we rely only on the retry mechanism
> from the page allocator instead? Does the success rate is reduced
> considerably?

It certainly will be reduced.  I 'think' it will be hard to predict how
much it will be reduced as this will depend on the state of memory usage
and fragmentation at the time of the attempt.

I can try to measure this, but I will be a few days due to U.S. holiday.
-- 
Mike Kravetz