On Sun, 8 Sep 2019, Vlastimil Babka wrote: > > On Sat, 7 Sep 2019, Linus Torvalds wrote: > > > >>> Andrea acknowledges the swap storm that he reported would be fixed with > >>> the last two patches in this series > >> > >> The problem is that even you aren't arguing that those patches should > >> go into 5.3. > >> > > > > For three reasons: (a) we lack a test result from Andrea, > > That's argument against the rfc patches 3+4s, no? But not for including > the reverts of reverts of reverts (patches 1+2). > Yes, thanks: I would strongly prefer not to propose rfc patches 3-4 without a testing result from Andrea and collaboration to fix the underlying issue. My suggestion to Linus is to merge patches 1-2 so we don't have additional semantics for MADV_HUGEPAGE or thp enabled=always configs based on kernel version, especially since they are already conflated. > > (b) there's > > on-going discussion, particularly based on Vlastimil's feedback, and > > I doubt this will be finished and tested with reasonable confidence even > for the 5.4 merge window. > Depends, but I probably suspect the same. If the reverts to 5.3 are not applied, then I'm not at all confident that forward progress on this issue will be made: my suggestion about changes to the page allocator when the patches were initially proposed went unresponded to, as did the ping on those suggestions, and now we have a simplistic "this will fix the swap storms" but no active involvement from Andrea to improve this; he likely is quite content on lumping NUMA policy onto an already overloaded madvise mode. [ NOTE! The rest of this email and my responses are about how to address the default page allocation behavior which we can continue to discuss but I'd prefer it separated from the discussion of reverts for 5.3 which needs to be done to not conflate madvise modes with mempolicies for a subset of kernel versions. ] > > It indicates that progress has been made to address the actual bug without > > introducing long-lived access latency regressions for others, particularly > > those who use MADV_HUGEPAGE. In the worst case, some systems running > > 5.3-rc4 and 5.3-rc5 have the same amount of memory backed by hugepages but > > on 5.3-rc5 the vast majority of it is allocated remotely. This incurs a > > It's been said before, but such sensitive code generally relies on > mempolicies or node reclaim mode, not THP __GFP_THISNODE implementation > details. Or if you know there's enough free memory and just needs to be > compacted, you could do it once via sysfs before starting up your workload. > This entire discussion is based on the long standing and default behavior of page allocation for transparent hugepages. Your suggestions are not possible for two reasons: (1) I cannot enforce a mempolicy of MPOL_BIND because this doesn't allow fallback at all and would oom kill if the local node is oom, and (2) node reclaim mode is a system-wide setting so all workloads are affected for every page allocation, not only users of MADV_HUGEPAGE who specifically opt-in to expensive allocation. We could make the argument that Andrea's qemu usecase could simply use MPOL_PREFERRED for memory that should be faulted remotely which would provide more control and would work for all versions of Linux regardless of MADV_HUGEPAGE or not; that's a much more simple workaround than conflating MADV_HUGEPAGE for NUMA locality, asking users who are adversely affected by 5.3 to create new mempolicies to work around something that has always worked fine, or asking users to tune page allocator policies with sysctls. > > I'm arguing to revert 5.3 back to the behavior that we have had for years > > and actually fix the bug that everybody else seems to be ignoring and then > > *backport* those fixes to 5.3 stable and every other stable tree that can > > use them. Introducing a new mempolicy for NUMA locality into 5.3.0 that > > I think it's rather removing the problematic implicit mempolicy of > __GFP_THISNODE. > I'm referring to a solution that is backwards compatible for existing users which 5.3 is certainly not. > I might have missed something, but you were asked for a reproducer of > your use case so others can develop patches with it in mind? Mel did > provide a simple example that shows the swap storms very easily. > Are you asking for a synthetic kernel module that you can inject to induce fragmentation on a local node where memory compaction would be possible and then a userspace program that uses MADV_HUGEPAGE and fits within that node? The regression I'm reporting is for workloads that fit within a socket, it requires local fragmentation to show a regression. For the qemu case, it's quite easy to fill a local node and require additional hugepage allocations with MADV_HUGEPAGE in a test case, but for without synthetically inducing fragmentation I cannot provide a testcase that will show performance regression because memory is quickly faulted remotely rather than compacting locally.