[LSF/MM TOPIC] NUMA remote THP vs NUMA local non-THP under MADV_HUGEPAGE

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Tue, 29 Jan 2019 18:40:58 -0500

Hello,

I'd like to attend the LSF/MM Summit 2019. I'm interested in most MM
topics and it's enlightening to listen to the common non-MM topics
too.

One current topic that could be of interest is the THP / NUMA tradeoff
in subject.

One issue about a change in MADV_HUGEPAGE behavior made ~3 years ago
kept floating around for the last 6 months (~12 months since it was
initially reported as regression through an enterprise-like workload)
and it was hot-fixed in commit
ac5b2c18911ffe95c08d69273917f90212cf5659, but it got quickly reverted
for various reasons.

I posted some benchmark results showing that for tasks without strong
NUMA locality the __GFP_THISNODE logic is not guaranteed to be optimal
(and here of course I mean even if we ignore the large slowdown with
swap storms at allocation time that might be caused by
__GFP_THISNODE). The results also show NUMA remote THPs help
intrasocket as well as intersocket.

https://lkml.kernel.org/r/20181210044916.GC24097@xxxxxxxxxx
https://lkml.kernel.org/r/20181212104418.GE1130@xxxxxxxxxx

The following seems the interim conclusion which I happen to be in
agreement with Michal and Mel:

https://lkml.kernel.org/r/20181212095051.GO1286@xxxxxxxxxxxxxx
https://lkml.kernel.org/r/20181212170016.GG1130@xxxxxxxxxx

Hopefully this strict issue will be hot-fixed before April (like we
had to hot-fix it in the enterprise kernels to avoid the 3 years old
regression to break large workloads that can't fit it in a single NUMA
node and I assume other enterprise distributions will follow suit),
but whatever hot-fix will likely allow ample margin for discussions on
what we can do better to optimize the decision between local non-THP
and remote THP under MADV_HUGEPAGE.

It is clear that the __GFP_THISNODE forced in the current code
provides some minor advantage to apps using MADV_HUGEPAGE that can fit
in a single NUMA node, but we should try to achieve it without major
disadvantages to apps that can't fit in a single NUMA node.

For example it was mentioned that we could allocate readily available
already-free local 4k if local compaction fails and the watermarks
still allows local 4k allocations without invoking reclaim, before
invoking compaction on remote nodes. The same can be repeated at a
second level with intra-socket non-THP memory before invoking
compaction inter-socket. However we can't do things like that with the
current page allocator workflow. It's possible some larger change is
required than just sending a single gfp bitflag down to the page
allocator that creates an implicit MPOL_LOCAL binding to make it
behave like the obsoleted numa/zone reclaim behavior, but weirdly only
applied to THP allocations.

--

In addition to the above "NUMA remote THP vs NUMA local non-THP
tradeoff" topic, there are other developments in "userfaultfd" land that
are approaching merge readiness and that would be possible to provide a
short overview about:

- Peter Xu made significant progress in finalizing the userfaultfd-WP
  support over the last few months. That feature was planned from the
  start and it will allow userland to do some new things that weren't
  possible to achieve before. In addition to synchronously blocking
  write faults to be resolved by an userland manager, it has also the
  ability to obsolete the softdirty feature, because it can provide
  the same information, but with O(1) complexity (as opposed of the
  current softdirty O(N) complexity) similarly to what the Page
  Modification Logging (PML) does in hardware for EPT write accesses.

- Blake Caldwell maintained the UFFDIO_REMAP support to atomically
  remove memory from a mapping with userfaultfd (which can't be done
  with a copy as in UFFDIO_COPY and it requires a slow TLB flush to be
  safe) as an alternative to host swapping (which of course also
  requires a TLB flush for similar reasons). Notably UFFDIO_REMAP was
  rightfully naked early on and quickly replaced by UFFDIO_COPY which
  is more optimal to add memory to a mapping is small chunks, but we
  can't remove memory with UFFDIO_COPY and UFFDIO_REMAP should be as
  efficient as it gets when it comes to removing memory from a
  mapping.

Thank you,
Andrea