Re: [PATCH 2/2] mm: thp: fix transparent_hugepage/defrag = madvise || always

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Tue, 21 Aug 2018 17:40:49 -0400

On Tue, Aug 21, 2018 at 01:50:57PM +0200, Michal Hocko wrote:
> So does reverting 5265047ac301 ("mm, thp: really limit transparent
> hugepage allocation to local node") help?

That won't revert clean, to be sure you'd need to bisect around it
which I haven't tried because I don't think there was doubt around it
(and only the part in mempolicy.c is relevant at it).

It's not so important to focus on that commit the code changed again
later, I'm focusing on the code as a whole.

It's not just that commit that is the problem here, the problem starts
in the previous commits that adds the NUMA locality logic, the
addition of __GFP_THISNODE is the icing on the cake that makes things
fall apart.

> I really detest a new gfp flag for one time semantic that is muddy as
> hell.

Well there's no way to fix this other than to prevent reclaim to run,
if you still want to give a chance to page faults to obtain THP under
MADV_HUGEPAGE in the page fault without waiting minutes or hours for
khugpaged to catch up with it.

> This is simply incomprehensible. How can anybody who is not deeply
> familiar with the allocator/reclaim internals know when to use it.

Nobody should use this in drivers, it's a __GFP flag.

Note:

	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {

#define alloc_hugepage_vma(gfp_mask, vma, addr, order)	\
	alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true)

Only THP is ever affected by the BUG so nothing else will ever need to
call __GFP_COMPACT_ONLY. It is a VM internal flag, I wish there was a
way to make the build fail if a driver would use it but there isn't
right now.

All other order 9 or 10 allocations from drivers won't call
alloc_hugepage_vma. Only mm/huge_memory.c ever calls that which is why
the regression only happens with MADV_HUGEPAGE (i.e. qemu is being
bitten badly on NUMA hosts).

> If this is really a regression then we should start by pinpointing the

You can check yourself, create a 2 node vnuma guest or pick any host
with more than one node. Set defrag=always and run "memhog -r11111111
18g" if host has 16g per node. Add some swap and notice the swap storm
while all ram is left free in the other node.

> real culprit and go from there. If this is really 5265047ac301 then just

In my view there's no single culprit, but it was easy to identify the
last drop that made the MADV_HUGEPAGE glass overflow, and it's that
commit that adds __GFP_THISNODE. The combination of the previous code
that prioritized NUMA over THP and then the MADV_HUGEPAGE logic that
still uses compaction (and in turn reclaim if compaction fails with
COMPACT_SKIPPED because there's no 4k page in the local node) just
falls apart with __GFP_THISNODE set as well on top of it and it
doesn't do the expected thing either without it (i.e. THP gets
priority over NUMA locality without such flag).

__GFP_THISNODE and the logic there, only works ok when
__GFP_DIRECT_RECLAIM is not set, i.e. MADV_HUGEPAGE not set.

We don't want to wait hours for khugepaged to catch up in qemu to get
THP. Compaction is certainly worth it to run if the userland
explicitly gives the hint to the kernel the allocations are long
lived.

> start by reverting it. I strongly suspect there is some mismatch in
> expectations here. What others consider acceptable seems to be a problem
> for others. I believe that was one of the reasons why we have changed
> the default THP direct compaction behavior, no?

This is not about the "default" though, I'm not changing the default
either, this is about MADV_HUGEPAGE behavior and when you change from
the defrag "default" value to "always" (which is equivalent than
having have all vmas set with MADV_HUGEPAGE).

QEMU is being optimal in setting MADV_HUGEPAGE, and rightfully so, but
it's getting punished badly because of this kernel bug.

Thanks,
Andrea