Re: [PATCH RESEND 0/8] hugetlb: add demote/split page functionality

Mike Kravetz <mike.kravetz@xxxxxxxxxx> · Fri, 27 Aug 2021 16:04:47 -0700

On 8/27/21 10:22 AM, Vlastimil Babka wrote:
> On 8/25/21 00:08, Mike Kravetz wrote:
>> Add Vlastimil and Hillf,
>>
>> Well, I set up a test environment on a larger system to get some
>> numbers.  My 'load' on the system was filling the page cache with
>> clean pages.  The thought is that these pages could easily be reclaimed.
>>
>> When trying to get numbers I hit a hugetlb page allocation stall where
>> __alloc_pages(__GFP_RETRY_MAYFAIL, order 9) would stall forever (or at
>> least an hour).  It was very much like the symptoms addressed here:
>> https://lore.kernel.org/linux-mm/20190806014744.15446-1-mike.kravetz@xxxxxxxxxx/
>>
>> This was on 5.14.0-rc6-next-20210820.
>>
>> I'll do some more digging as this appears to be some dark corner case of
>> reclaim and/or compaction.  The 'good news' is that I can reproduce
>> this.
> 
> Interesting, let's see if that's some kind of new regression.
> 
>>> And the second problem would benefit from some words to help us
>>> understand how much real-world hurt this causes, and how frequently.
>>> And let's understand what the userspace workarounds look like, etc.
>>
>> The stall above was from doing a simple 'free 1GB page' followed by
>> 'allocate 512 MB pages' from userspace.
> 
> Is the allocation different in any way than the usual hugepage allocation
> possible today?

No, it is the same.  I was just following the demote use case of free
1GB page and allocate 512 2MB pages.  Of course, you would mostly expect
that to succeed.  The exception is if something else is running and
grabs some of those 1GB worth of contiguous pages such that you can not
allocate 512 2MB pages.  My test case was to have those freed pages be
used for file I/O but kept clean so that could easily be reclaimed.

I 'may' have been over stressing the system with all CPUs doing file
reads to fill the page cache with clean pages.  I certainly need to
spend some more debug/analysis time on this.
-- 
Mike Kravetz