On Tue, Aug 21, 2018 at 10:26:54AM -0700, David Rientjes wrote: > MADV_HUGEPAGE (or defrag == "always") would now become a combination of > "try to compact locally" and "allocate remotely if necesary" without the > ability to avoid the latter absent a mempolicy that affects all memory I don't follow why compaction should run only on the local node in such case (i.e. __GFP_THISNODE removed when __GFP_DIRECT_RECLAIM is set). The zonelist will simply span all nodes so compaction & reclaim should both run on all for MADV_HUGEPAGE with option 2). The only mess there is in the allocator right now is that compaction runs per zone and reclaim runs per node but that's another issue and won't hurt for this case. > allocations. I think the complete solution would be a MPOL_F_HUGEPAGE > flag that defines mempolicies for hugepage allocations. In my experience > thp falling back to remote nodes for intrasocket latency is a win but > intersocket or two-hop intersocket latency is a no go. Yes, that's my expectation too. So what you suggest is to add a new hard binding, that allows altering the default behavior for THP, that sure sounds fine. We've still to pick the actual default and decide if a single default is ok or it should be tunable or even change the default depending on the NUMA topology. I suspect it's a bit overkill to have different defaults depending on NUMA topology. There have been defaults for obscure things like numa_zonelist_order that changed behavior depending on number of nodes and they happened to hurt on some system. I ended up tuning them to the current default (until the runtime tuning was removed). It's a bit hard to just pick the best just based on arbitrary things like number of numa nodes or distance, especially when what is better also depends on the actual application. I think options are sane behaviors with some pros and cons, and option 2) is simpler and will likely perform better on smaller systems, option 1) is less risky in larger systems. In any case the watermark optimization to set __GFP_THISNODE only if there's plenty of PAGE_SIZEd memory in the local node, remains a valid optimization for later for the default "defrag" value (i.e. no MADV_HUGEPAGE) not setting __GFP_DIRECT_RECLAIM. If there's no RAM free in the local node we can totally try to pick the THP from the other nodes and not doing so only has the benefit of saving the watermark check itself. Thanks, Andrea