On Mon, 20 May 2019, Mel Gorman wrote: > > There was exhausting discussion subsequent to this that caused Linus to > > have to revert the offending commit late in an rc series that is not > > described here. > > Yes, at the crux of that matter was which regression introduced was more > important -- the one causing swap storms which Andrea is trying to address > or a latency issue due to assumptions of locality when MADV_HUGEPAGE > is used. > > More people are affected by swap storms and distributions are carrying > out-of-tree patches to address it. Furthermore, multiple people unrelated > to each other can trivially reproduce the problem with test cases and > experience the problem with real workloads. Only you has a realistic > workload sensitive to the latency issue and we've asked repeatedly for > a test case (most recently Michal Hocko on May 4th) which is still not > available. > Hi Mel, Any workload that does MADV_HUGEPAGE will be impacted if remote hugepage access latency is greater than local native page access latency and is using the long-standing behavior of the past three years. The test case would be rather straight forward: induce node local fragmentation (easiest to do by injecting a kernel module), do MADV_HUGEPAGE over a large range, fault, and measure random access latency. This is readily observable and can be done synthetically to measure the random access latency of local native pages vs remote hugepages. Andrea provided this testcase in the original thread. My results from right now: # numactl -m 0 -C 0 ./numa-thp-bench random writes MADV_HUGEPAGE 17492771 usec random writes MADV_NOHUGEPAGE 21344846 usec random writes MADV_NOHUGEPAGE 21399545 usec random writes MADV_HUGEPAGE 17481949 usec # numactl -m 0 -C 64 ./numa-thp-bench random writes MADV_HUGEPAGE 26858061 usec random writes MADV_NOHUGEPAGE 31067825 usec random writes MADV_NOHUGEPAGE 31334770 usec random writes MADV_HUGEPAGE 26785942 usec That's 25.8% greater random access latency when going across socket vs accessing local native pages. > > This isn't an argument in support of this patch, there is a difference > > between (1) pages of the native page size being faulted first locally > > falling back remotely and (2) hugepages being faulted first locally and > > falling back to native pages locally because it has better access latency > > on most platforms for workloads that do not span multiple nodes. Note > > that the page allocator is unaware whether the workload spans multiple > > nodes so it cannot make this distinction today, and that's what I'd prefer > > to focus on rather than changing an overall policy for everybody. > > > > Overall, I think it would be ok to have behaviour whereby local THP is > allocated if cheaply, followed by base pages local followed by the remote > options. However, __GFP_THISNODE removes the possibility of allowing > remote fallback and instead causing a swap storm and swap storms are > trivial to generate on NUMA machine running a mainline kernel today. > Yes, this is hopefully what we can focus on and I hope we can make forward progress with (1) extending mempolicies to allow specifying hugepage specific policies, (2) the prctl(), (3) improving the feedback loop between compaction and direct reclaim, and/or (4) resolving the overloaded the conflicting meanings of /sys/kernel/mm/transparent_hugepage/{enabled,defrag} and MADV_HUGEPAGE/MADV_NOHUGEPAGE. The issue here is the overloaded nature of what MADV_HUGEPAGE means and what the system-wide thp settings mean. It cannot possibly provide sane behavior of all possible workloads given only two settings. MADV_HUGEPAGE itself has *four* meanings: (1) determine hugepage eligiblity when not default, (2) try to do sychronous compaction/reclaim at fault, (3) determine eligiblity of khugepaged, (4) control defrag settings based on system-wide setting. The patch here is adding a fifth: (5) prefer remote allocation when local memory is fragmented. None of this is sustainable. Note that this patch is also preferring remote hugepage allocation *over* local hugepages before trying memory compaction locally depending on the setting of vm.zone_reclaim_mode so it is infringing on the long-standing behavior of (2) as well. In situations such as these, it is not surprising that there are issues reported with any combination of flags or settings and patches get proposed to are very workload dependent. My suggestion has been to move in a direction where this can be resolved such that userspace has a clean and stable API and we can allow remote hugepage allocation for workloads that specifically opt-in, but not incur 25.8% greater access latency for using the behavior of the past 3+ years. Another point that has consistently been raised on LKML is the inability to disable MADV_HUGEPAGE once set: i.e. if you set it for faulting your workload, you are required to do MADV_NOHUGEPAGE to clear it and then are explicitly asking that this memory is not backed by hugepages. > > It may not have been meant to provide this, but when IBM changed this > > three years ago because of performance regressions and others have started > > to use MADV_HUGEPAGE with that policy in mind, it is the reality of what > > the madvise advice has provided. What was meant to be semantics of > > MADV_HUGEPAGE three years ago is irrelevant today if it introduces > > performance regressions for users who have used the advice mode during > > that past three years. > > > > Incurring swap storms when there is plenty of free memory available is > terrible. If the working set size is larger than a node, the swap storm > may even persist indefinitely. > Let's fix it. > > Yes, this is clearly understood and was never objected to when this first > > came up in the thread where __GFP_THISNODE was removed or when Linus > > reverted the patch. > > > > The issue being discussed here is a result of MADV_HUGEPAGE being > > overloaded: it cannot mean to control (1) how much compaction/reclaim is > > done for page allocation, (2) the NUMA locality of those hugepages, and > > (3) the eligibility of the memory to be collapsed into hugepages by > > khugepaged all at the same time. > > > > I suggested then that we actually define (2) concretely specifically for > > the usecase that you mention. Changing the behavior of MADV_HUGEPAGE for > > the past three years, however, and introducing performance regressions for > > those users is not an option regardless of the intent that it had when > > developed. > > > > The current behaviour is potential swap storms that are difficult to > avoid because the behaviour is hard-wired into the kernel internals. > Only disabling THP, either on a task or global basis, avoids it when > encountered. > We are going in circles, *yes* there is a problem for potential swap storms today because of the poor interaction between memory compaction and directed reclaim but this is a result of a poor API that does not allow userspace to specify that its workload really will span multiple sockets so faulting remotely is the best course of action. The fix is not to cause regressions for others who have implemented a userspace stack that is based on the past 3+ years of long standing behavior or for specialized workloads where it is known that it spans multiple sockets so we want some kind of different behavior. We need to provide a clear and stable API to define these terms for the page allocator that is independent of any global setting of thp enabled, defrag, zone_reclaim_mode, etc. It's workload dependent. Is it deserving of an MADV_REMOTE_HUGEPAGE, a prctl(), hugepage extended mempolicies, or something else? I'm positive that the problem is understood and we could reach some form of consensus before an implementation is invested into if we actually discuss ways to fix the underlying issue. Thanks.