On Fri, 24 May 2019, Andrea Arcangeli wrote: > > > We are going in circles, *yes* there is a problem for potential swap > > > storms today because of the poor interaction between memory compaction and > > > directed reclaim but this is a result of a poor API that does not allow > > > userspace to specify that its workload really will span multiple sockets > > > so faulting remotely is the best course of action. The fix is not to > > > cause regressions for others who have implemented a userspace stack that > > > is based on the past 3+ years of long standing behavior or for specialized > > > workloads where it is known that it spans multiple sockets so we want some > > > kind of different behavior. We need to provide a clear and stable API to > > > define these terms for the page allocator that is independent of any > > > global setting of thp enabled, defrag, zone_reclaim_mode, etc. It's > > > workload dependent. > > > > um, who is going to do this work? > > That's a good question. It's going to be a not simple patch to > backport to -stable: it'll be intrusive and it will affect > mm/page_alloc.c significantly so it'll reject heavy. I wouldn't > consider it -stable material at least in the short term, it will > require some testing. > Hi Andrea, I'm not sure what patch you're referring to, unfortunately. The above comment was referring to APIs that are made available to userspace to define when to fault locally vs remotely and what the preference should be for any form of compaction or reclaim to achieve that. Today we have global enabling options, global defrag settings, enabling prctls, and madvise options. The point it makes is that whether a specific workload fits into a single socket is workload dependant and thus we are left with prctls and madvise options. The prctl either enables thp or it doesn't, it is not interesting here; the madvise is overloaded in four different ways (enabling, stalling at fault, collapsability, defrag) so it's not surprising that continuing to overload it for existing users will cause undesired results. It makes an argument that we need a clear and stable means of defining the behavior, not changing the 4+ year behavior and giving those who regress no workaround. > This is why applying a simple fix that avoids the swap storms (and the > swap-less pathological THP regression for vfio device assignment GUP > pinning) is preferable before adding an alloc_pages_multi_order (or > equivalent) so that it'll be the allocator that will decide when > exactly to fallback from 2M to 4k depending on the NUMA distance and > memory availability during the zonelist walk. The basic idea is to > call alloc_pages just once (not first for 2M and then for 4k) and > alloc_pages will decide which page "order" to return. > The commit description doesn't mention the swap storms that you're trying to fix, it's probably better to describe that again and why it is not beneficial to swap unless an entire pageblock can become free or memory compaction has indicated that additional memory freeing would allow migration to make an entire pageblock free. I understand that's a invasive code change, but merging this patch changes the 4+ year behavior that started here: commit 077fcf116c8c2bd7ee9487b645aa3b50368db7e1 Author: Aneesh Kumar K.V <aneesh.kumar@xxxxxxxxxxxxxxxxxx> Date: Wed Feb 11 15:27:12 2015 -0800 mm/thp: allocate transparent hugepages on local node And that commit's description describes quite well the regression that we encounter if we remove __GFP_THISNODE here. That's because the access latency regression is much more substantial than what was reported for Naples in your changelog. In the interest of making forward progress, can we agree that swapping from the local node *never* makes sense unless we can show that an entire pageblock can become free or that it enables memory compaction to migrate memory that can make an entire pageblock free? Are you reporting swap storms for the local node when one of these is true? > > Implementing a new API doesn't help existing userspace which is hurting > > from the problem which this patch addresses. > > Yes, we can't change all apps that may not fit in a single NUMA > node. Currently it's unsafe to turn "transparent_hugepages/defrag = > always" or the bad behavior can then materialize also outside of > MADV_HUGEPAGE. Those apps that use MADV_HUGEPAGE on their long lived > allocations (i.e. guest physical memory) like qemu are affected even > with the default "defrag = madvise". Those apps are using > MADV_HUGEPAGE for more than 3 years and they are widely used and open > source of course. > I continue to reiterate that the 4+ year long standing behavior of MADV_HUGEPAGE is overloaded; you are anticipating a specific behavior for workloads that do not fit in a single NUMA node whereas other users developed in the past four years are anticipating a different behavior. I'm trying to propose solutions that can not cause regressions for any user, such as the prctl() example that is inherited across fork, and can be used to define the behavior. This could be a very trivial extension to prctl(PR_SET_THP_DISABLE) or it could be more elaborate as an addition. This would be set by any thread that forks qemu and can define that the workload prefers remote hugepages because it spans more than one node. Certainly we should agree that the majority of Linux workloads do not span more than one socket. However, it *may* be possible to define this as a global thp setting since most machines that run large guests are only running large guests so the default machine-level policy can reflect that.