Hello everyone, On Thu, May 23, 2019 at 05:57:37PM -0700, Andrew Morton wrote: > On Mon, 20 May 2019 10:54:16 -0700 (PDT) David Rientjes <rientjes@xxxxxxxxxx> wrote: > > > We are going in circles, *yes* there is a problem for potential swap > > storms today because of the poor interaction between memory compaction and > > directed reclaim but this is a result of a poor API that does not allow > > userspace to specify that its workload really will span multiple sockets > > so faulting remotely is the best course of action. The fix is not to > > cause regressions for others who have implemented a userspace stack that > > is based on the past 3+ years of long standing behavior or for specialized > > workloads where it is known that it spans multiple sockets so we want some > > kind of different behavior. We need to provide a clear and stable API to > > define these terms for the page allocator that is independent of any > > global setting of thp enabled, defrag, zone_reclaim_mode, etc. It's > > workload dependent. > > um, who is going to do this work? That's a good question. It's going to be a not simple patch to backport to -stable: it'll be intrusive and it will affect mm/page_alloc.c significantly so it'll reject heavy. I wouldn't consider it -stable material at least in the short term, it will require some testing. This is why applying a simple fix that avoids the swap storms (and the swap-less pathological THP regression for vfio device assignment GUP pinning) is preferable before adding an alloc_pages_multi_order (or equivalent) so that it'll be the allocator that will decide when exactly to fallback from 2M to 4k depending on the NUMA distance and memory availability during the zonelist walk. The basic idea is to call alloc_pages just once (not first for 2M and then for 4k) and alloc_pages will decide which page "order" to return. > Implementing a new API doesn't help existing userspace which is hurting > from the problem which this patch addresses. Yes, we can't change all apps that may not fit in a single NUMA node. Currently it's unsafe to turn "transparent_hugepages/defrag = always" or the bad behavior can then materialize also outside of MADV_HUGEPAGE. Those apps that use MADV_HUGEPAGE on their long lived allocations (i.e. guest physical memory) like qemu are affected even with the default "defrag = madvise". Those apps are using MADV_HUGEPAGE for more than 3 years and they are widely used and open source of course. > It does appear to me that this patch does more good than harm for the > totality of kernel users, so I'm inclined to push it through and to try > to talk Linus out of reverting it again. That sounds great. It's also what 3 enterprise distributions had to do already. As Mel described in detail, remote THP can't be slower than the swap I/O (even if we'd swap on a nvdimm it wouldn't change this). As Michael suggested a dynamic "numa_node_id()" mbind could be pursued orthogonally to still be able to retain the current upstream behavior for small apps that can fit in the node and do extremely long lived static allocations and that don't care if they cause a swap storm during startup. All we argue about is the default "defrag = always" and MADV_HUGEPAGE behavior. The current behavior of "defrag = always" and MADV_HUGEPAGE is way more aggressive than zone_reclaim_mode in fact, which is also not enabled by default for similar reasons (but enabling zone_reclaim_mode by default would cause much less risk of pathological regressions to large workloads that can't fit in a single node). Enabling zone_reclaim_mode would eventually fallback to remote nodes gracefully. As opposed the fallback to remote nodes with __GFP_THISNODE can only happen after the 2M allocation has failed and the problem is that 2M allocation don't fail because compaction+reclaim interleaving keeps succeeding by swapping out more and more memory, which would the perfectly right behavior for compaction+reclaim interleaving if only the whole system would be out of memory in all nodes (and it isn't). The false positive result from the automated testing (where swapping overall performance decreased because fariness increased) wasn't anybody's fault and so the revert at the end of the merge window was a safe approach. So we can try again to fix it now. Thanks! Andrea