On 8/25/21 00:08, Mike Kravetz wrote: > Add Vlastimil and Hillf, > > Well, I set up a test environment on a larger system to get some > numbers. My 'load' on the system was filling the page cache with > clean pages. The thought is that these pages could easily be reclaimed. > > When trying to get numbers I hit a hugetlb page allocation stall where > __alloc_pages(__GFP_RETRY_MAYFAIL, order 9) would stall forever (or at > least an hour). It was very much like the symptoms addressed here: > https://lore.kernel.org/linux-mm/20190806014744.15446-1-mike.kravetz@xxxxxxxxxx/ > > This was on 5.14.0-rc6-next-20210820. > > I'll do some more digging as this appears to be some dark corner case of > reclaim and/or compaction. The 'good news' is that I can reproduce > this. Interesting, let's see if that's some kind of new regression. >> And the second problem would benefit from some words to help us >> understand how much real-world hurt this causes, and how frequently. >> And let's understand what the userspace workarounds look like, etc. > > The stall above was from doing a simple 'free 1GB page' followed by > 'allocate 512 MB pages' from userspace. Is the allocation different in any way than the usual hugepage allocation possible today? > Getting out another version of this series will be delayed, as I think > we need to address or understand this issue first. >