Hello, I'd like to attend the LSF/MM Summit 2019. I'm interested in most MM topics and it's enlightening to listen to the common non-MM topics too. One current topic that could be of interest is the THP / NUMA tradeoff in subject. One issue about a change in MADV_HUGEPAGE behavior made ~3 years ago kept floating around for the last 6 months (~12 months since it was initially reported as regression through an enterprise-like workload) and it was hot-fixed in commit ac5b2c18911ffe95c08d69273917f90212cf5659, but it got quickly reverted for various reasons. I posted some benchmark results showing that for tasks without strong NUMA locality the __GFP_THISNODE logic is not guaranteed to be optimal (and here of course I mean even if we ignore the large slowdown with swap storms at allocation time that might be caused by __GFP_THISNODE). The results also show NUMA remote THPs help intrasocket as well as intersocket. https://lkml.kernel.org/r/20181210044916.GC24097@xxxxxxxxxx https://lkml.kernel.org/r/20181212104418.GE1130@xxxxxxxxxx The following seems the interim conclusion which I happen to be in agreement with Michal and Mel: https://lkml.kernel.org/r/20181212095051.GO1286@xxxxxxxxxxxxxx https://lkml.kernel.org/r/20181212170016.GG1130@xxxxxxxxxx Hopefully this strict issue will be hot-fixed before April (like we had to hot-fix it in the enterprise kernels to avoid the 3 years old regression to break large workloads that can't fit it in a single NUMA node and I assume other enterprise distributions will follow suit), but whatever hot-fix will likely allow ample margin for discussions on what we can do better to optimize the decision between local non-THP and remote THP under MADV_HUGEPAGE. It is clear that the __GFP_THISNODE forced in the current code provides some minor advantage to apps using MADV_HUGEPAGE that can fit in a single NUMA node, but we should try to achieve it without major disadvantages to apps that can't fit in a single NUMA node. For example it was mentioned that we could allocate readily available already-free local 4k if local compaction fails and the watermarks still allows local 4k allocations without invoking reclaim, before invoking compaction on remote nodes. The same can be repeated at a second level with intra-socket non-THP memory before invoking compaction inter-socket. However we can't do things like that with the current page allocator workflow. It's possible some larger change is required than just sending a single gfp bitflag down to the page allocator that creates an implicit MPOL_LOCAL binding to make it behave like the obsoleted numa/zone reclaim behavior, but weirdly only applied to THP allocations. -- In addition to the above "NUMA remote THP vs NUMA local non-THP tradeoff" topic, there are other developments in "userfaultfd" land that are approaching merge readiness and that would be possible to provide a short overview about: - Peter Xu made significant progress in finalizing the userfaultfd-WP support over the last few months. That feature was planned from the start and it will allow userland to do some new things that weren't possible to achieve before. In addition to synchronously blocking write faults to be resolved by an userland manager, it has also the ability to obsolete the softdirty feature, because it can provide the same information, but with O(1) complexity (as opposed of the current softdirty O(N) complexity) similarly to what the Page Modification Logging (PML) does in hardware for EPT write accesses. - Blake Caldwell maintained the UFFDIO_REMAP support to atomically remove memory from a mapping with userfaultfd (which can't be done with a copy as in UFFDIO_COPY and it requires a slow TLB flush to be safe) as an alternative to host swapping (which of course also requires a TLB flush for similar reasons). Notably UFFDIO_REMAP was rightfully naked early on and quickly replaced by UFFDIO_COPY which is more optimal to add memory to a mapping is small chunks, but we can't remove memory with UFFDIO_COPY and UFFDIO_REMAP should be as efficient as it gets when it comes to removing memory from a mapping. Thank you, Andrea