On Sun, 2016-09-11 at 17:24 +0100, Mel Gorman wrote: > On Thu, Sep 08, 2016 at 09:30:53PM -0400, Rik van Riel wrote: > > Commit 4d9424669946 ("mm: convert p[te|md]_mknonnuma and remaining > > page table manipulations") changed NUMA balancing from _PAGE_NUMA > > to using PROT_NONE, and was quickly found to introduce a regression > > with NUMA grouping. > > > > It was followed up by these changesets: > > > > 53da3bc2ba9e ("mm: fix up numa read-only thread grouping logic") > > bea66fbd11af ("mm: numa: group related processes based on VMA flags > > instead of page table flags") > > b191f9b106ea ("mm: numa: preserve PTE write permissions across a > > NUMA hinting fault") > > > > The first of those two changesets try alternate approaches to NUMA > > grouping, which apparently do not work as well as looking at the > > PTE > > write permissions. > > > > The latter patch preserves the PTE write permissions across a NUMA > > protection fault. However, it forgets to revert the condition for > > whether or not to group tasks together back to what it was before > > 3.19, even though the information is now preserved in the page > > tables > > once again. > > > > This patch brings the NUMA grouping heuristic back to what it was > > before changeset 4d9424669946, which the changelogs of subsequent > > changesets suggest worked best. > > > > We have all the information again. We should probably use it. > > > > Patch looks ok other than the comment above the second hunk being out > of > date. Out of curiousity, what workload benefitted from this? I saw a > mix > of marginal results when I ran this on a 2-socket and 4-socket box. I did not performance test the change, because I believe the VM_WRITE test has a small logical error. Specifically, VM_WRITE is also true for VMAs that are PROT_WRITE|MAP_PRIVATE, which we do NOT want to group on. Every shared library mapped on my system seems to have a (small) read-write VMA: 00007f5adacff000 1764K r-x-- libc-2.23.so 00007f5adaeb8000 2044K ----- libc-2.23.so 00007f5adb0b7000 16K r---- libc-2.23.so 00007f5adb0bb000 8K rw--- libc-2.23.so In other words, the code that is currently upstream could result in programs being grouped into a numa group due to accesses to libc.so, if they happened to get started up right at the same time. This will not catch many programs, since most of them will have private copies of the pages in the small read-write segments by the time other programs start up, but it could catch a few of them. Testing on VM_WRITE|VM_SHARED would solve that issue, but at that point it would be essentially identical to reverting the code to the old pte_write() test that we had in 3.19 and before. I do not expect the performance impact to be visible, except when somebody gets very unlucky with application startup timing. -- All Rights Reversed.
Attachment:
signature.asc
Description: This is a digitally signed message part