Re: [PATCH 0/2] fix for "pathological THP behavior"

Vlastimil Babka <vbabka@xxxxxxx> · Tue, 21 Aug 2018 17:30:11 +0200

On 8/20/18 5:19 PM, Andrea Arcangeli wrote:
> Hi Kirill,
> 
> On Mon, Aug 20, 2018 at 02:58:18PM +0300, Kirill A. Shutemov wrote:
>> I personally prefer to prioritize NUMA locality over THP
>> (__GFP_COMPACT_ONLY variant), but I don't know page-alloc/compaction good
>> enough to Ack it.
> 
> If we go in this direction it'd be nice after fixing the showstopper
> bug, if we could then proceed with an orthogonal optimization by
> checking the watermarks and if the watermarks shows there are no
> PAGE_SIZEd pages available in the local node we should remove both
> __GFP_THISNODE and __GFP_COMPACT_ONLY.
> 
> If as opposed there's still PAGE_SIZEd free memory in the local node
> (not possible to compact for whatever reason), we should stick to
> __GFP_THISNODE | __GFP_COMPACT_ONLY.

If it's "not possible to compact" then the expected outcome of this is
to fail?

> It's orthogonal because the above addition would make sense also in
> the current (buggy) code.
> 
> The main implementation issue is that the watermark checking is not
> well done in mempolicy.c but the place to clear __GFP_THISNODE and
> __GFP_COMPACT_ONLY currently is there.

You could do that without calling watermark checking explicitly, but
it's rather complicated:

1. try alocating with __GFP_THISNODE and ~GFP_DIRECT_RECLAIM
2. if that fails, try PAGE_SIZE with same flags
3. if that fails, try THP size without __GFP_THISNODE
4. PAGE_SIZE without __GFP_THISNODE

Yeah, not possible in current alloc_pages_vma() which should return the
requested order. But the advantage is that it's not prone to races
between watermark checking and actual attempt.

> The case that the local node gets completely full and has not even 4k
> pages available should be totally common, because if you keep
> allocating and you allocate more than the size of a NUMA node
> eventually you will fill the local node with THP then consume all 4k
> pages and then you get into the case where the current code is totally
> unable to allocate THP from the other nodes and it would be totally
> possible to fix with the removal of __GFP_THISNODE |
> __GFP_COMPACT_ONLY, after the PAGE_SIZE watermark check.
> 
> I'm mentioning this optimization in this context, even if it's
> orthogonal, because the alternative patch that prioritizes THP over
> NUMA locality for MADV_HUGEPAGE and defer=always would solve that too
> with a one liner and there would be no need of watermark checking and
> flipping gfp bits whatsoever. Once the local node is full, THPs keeps
> being provided as expected.

Frankly, I would rather go with this option and assume that if someone
explicitly wants THP's, they don't care about NUMA locality that much.
(Note: I hate __GFP_THISNODE, it's an endless source of issues.)
Trying to be clever about "is there still PAGE_SIZEd free memory in the
local node" is imperfect anyway. If there isn't, is it because there's
clean page cache that we can easily reclaim (so it would be worth
staying local) or is it really exhausted? Watermark check won't tell...

Vlastimil

> Thanks,
> Andrea
>