Re: [PATCH 0/2] fix for "pathological THP behavior"

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Tue, 21 Aug 2018 18:18:43 -0400

On Tue, Aug 21, 2018 at 10:26:54AM -0700, David Rientjes wrote:
> MADV_HUGEPAGE (or defrag == "always") would now become a combination of 
> "try to compact locally" and "allocate remotely if necesary" without the 
> ability to avoid the latter absent a mempolicy that affects all memory 

I don't follow why compaction should run only on the local node in
such case (i.e. __GFP_THISNODE removed when __GFP_DIRECT_RECLAIM is
set).

The zonelist will simply span all nodes so compaction & reclaim should
both run on all for MADV_HUGEPAGE with option 2).

The only mess there is in the allocator right now is that compaction
runs per zone and reclaim runs per node but that's another issue and
won't hurt for this case.

> allocations.  I think the complete solution would be a MPOL_F_HUGEPAGE 
> flag that defines mempolicies for hugepage allocations.  In my experience 
> thp falling back to remote nodes for intrasocket latency is a win but 
> intersocket or two-hop intersocket latency is a no go.

Yes, that's my expectation too.

So what you suggest is to add a new hard binding, that allows altering
the default behavior for THP, that sure sounds fine.

We've still to pick the actual default and decide if a single default
is ok or it should be tunable or even change the default depending on
the NUMA topology.

I suspect it's a bit overkill to have different defaults depending on
NUMA topology. There have been defaults for obscure things like
numa_zonelist_order that changed behavior depending on number of nodes
and they happened to hurt on some system. I ended up tuning them to
the current default (until the runtime tuning was removed).

It's a bit hard to just pick the best just based on arbitrary things
like number of numa nodes or distance, especially when what is better
also depends on the actual application.

I think options are sane behaviors with some pros and cons, and option
2) is simpler and will likely perform better on smaller systems,
option 1) is less risky in larger systems.

In any case the watermark optimization to set __GFP_THISNODE only if
there's plenty of PAGE_SIZEd memory in the local node, remains a valid
optimization for later for the default "defrag" value (i.e. no
MADV_HUGEPAGE) not setting __GFP_DIRECT_RECLAIM. If there's no RAM
free in the local node we can totally try to pick the THP from the
other nodes and not doing so only has the benefit of saving the
watermark check itself.

Thanks,
Andrea