Re: [PATCH 2/2] Revert "mm, thp: restore node-local hugepage allocations"

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Fri, 24 May 2019 16:29:31 -0400

Hello everyone,

On Thu, May 23, 2019 at 05:57:37PM -0700, Andrew Morton wrote:
> On Mon, 20 May 2019 10:54:16 -0700 (PDT) David Rientjes <rientjes@xxxxxxxxxx> wrote:
> 
> > We are going in circles, *yes* there is a problem for potential swap 
> > storms today because of the poor interaction between memory compaction and 
> > directed reclaim but this is a result of a poor API that does not allow 
> > userspace to specify that its workload really will span multiple sockets 
> > so faulting remotely is the best course of action.  The fix is not to 
> > cause regressions for others who have implemented a userspace stack that 
> > is based on the past 3+ years of long standing behavior or for specialized 
> > workloads where it is known that it spans multiple sockets so we want some 
> > kind of different behavior.  We need to provide a clear and stable API to 
> > define these terms for the page allocator that is independent of any 
> > global setting of thp enabled, defrag, zone_reclaim_mode, etc.  It's 
> > workload dependent.
> 
> um, who is going to do this work?

That's a good question. It's going to be a not simple patch to
backport to -stable: it'll be intrusive and it will affect
mm/page_alloc.c significantly so it'll reject heavy. I wouldn't
consider it -stable material at least in the short term, it will
require some testing.

This is why applying a simple fix that avoids the swap storms (and the
swap-less pathological THP regression for vfio device assignment GUP
pinning) is preferable before adding an alloc_pages_multi_order (or
equivalent) so that it'll be the allocator that will decide when
exactly to fallback from 2M to 4k depending on the NUMA distance and
memory availability during the zonelist walk. The basic idea is to
call alloc_pages just once (not first for 2M and then for 4k) and
alloc_pages will decide which page "order" to return.

> Implementing a new API doesn't help existing userspace which is hurting
> from the problem which this patch addresses.

Yes, we can't change all apps that may not fit in a single NUMA
node. Currently it's unsafe to turn "transparent_hugepages/defrag =
always" or the bad behavior can then materialize also outside of
MADV_HUGEPAGE. Those apps that use MADV_HUGEPAGE on their long lived
allocations (i.e. guest physical memory) like qemu are affected even
with the default "defrag = madvise". Those apps are using
MADV_HUGEPAGE for more than 3 years and they are widely used and open
source of course.

> It does appear to me that this patch does more good than harm for the
> totality of kernel users, so I'm inclined to push it through and to try
> to talk Linus out of reverting it again.  

That sounds great. It's also what 3 enterprise distributions had to do
already.

As Mel described in detail, remote THP can't be slower than the swap
I/O (even if we'd swap on a nvdimm it wouldn't change this).

As Michael suggested a dynamic "numa_node_id()" mbind could be pursued
orthogonally to still be able to retain the current upstream behavior
for small apps that can fit in the node and do extremely long lived
static allocations and that don't care if they cause a swap storm
during startup. All we argue about is the default "defrag = always"
and MADV_HUGEPAGE behavior.

The current behavior of "defrag = always" and MADV_HUGEPAGE is way
more aggressive than zone_reclaim_mode in fact, which is also not
enabled by default for similar reasons (but enabling zone_reclaim_mode
by default would cause much less risk of pathological regressions to
large workloads that can't fit in a single node). Enabling
zone_reclaim_mode would eventually fallback to remote nodes
gracefully. As opposed the fallback to remote nodes with
__GFP_THISNODE can only happen after the 2M allocation has failed and
the problem is that 2M allocation don't fail because
compaction+reclaim interleaving keeps succeeding by swapping out more
and more memory, which would the perfectly right behavior for
compaction+reclaim interleaving if only the whole system would be out
of memory in all nodes (and it isn't).

The false positive result from the automated testing (where swapping
overall performance decreased because fariness increased) wasn't
anybody's fault and so the revert at the end of the merge window was a
safe approach. So we can try again to fix it now.

Thanks!
Andrea