With Dave Hansen's patches merged into Linus's tree https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4 PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node effectively and efficiently is worth exploring. There have been a couple of proposals posted on the mailing list [1] [2] [3]. I already posted two versions of patchset for demoting/promoting memory pages between DRAM and PMEM before this topic was discussed at LSF/MM 2019 (https://lwn.net/Articles/787418/). I do appreciate all the great suggestions from the community. This updated version implemented the most discussion, please see the below design section for the details. Changelog ========= v2 --> v3: * Introduced "migrate mode" for node reclaim. Just do demotion when "migrate mode" is specified per Michal Hocko and Mel Gorman. * Introduced "migrate target" concept for VM per Mel Gorman. The memory nodes which are under DRAM in the hierarchy (i.e. lower bandwidth, higher latency, larger capacity and cheaper than DRAM) are considered as "migrate target" nodes. When "migrate mode" is on, memory reclaim would demote pages to the "migrate target" nodes. * Dropped "twice access" promotion patch per Michal Hocko. * Changed the subject for the patchset to reflect the update. * Rebased to 5.2-rc1. v1 --> v2: * Dropped the default allocation node mask. The memory placement restriction could be achieved by mempolicy or cpuset. * Dropped the new mempolicy since its semantic is not that clear yet. * Dropped PG_Promote flag. * Defined N_CPU_MEM nodemask for the nodes which have both CPU and memory. * Extended page_check_references() to implement "twice access" check for anonymous page in NUMA balancing path. * Reworked the memory demotion code. v2: https://lore.kernel.org/linux-mm/1554955019-29472-1-git-send-email-yang.shi@xxxxxxxxxxxxxxxxx/ v1: https://lore.kernel.org/linux-mm/1553316275-21985-1-git-send-email-yang.shi@xxxxxxxxxxxxxxxxx/ Design ====== With the development of new memory technology, we could have cheaper and larger memory device on the system, which may have higher latency and lower bandwidth than DRAM, i.e. PMEM. It could be used as persistent storage or volatile memory. It fits into the memory hierarchy as a second tier memory. The patchset tries to explore an approach to utilize such memory to improve the memory placement. Basically, the patchset tries to achieve this goal by doing memory promotion/demotion via NUMA balancing and memory reclaim. Introduce a new "migrate" mode for node reclaim. When DRAM has memory pressure, demote pages to PMEM via node reclaim path if "migrate" mode is on. Then NUMA balancing will promote pages to DRAM as long as the page is referenced again. The memory pressure on PMEM node would push the inactive pages of PMEM to disk via swap. Introduce "primary" node and "migrate target" node concepts for VM (patch 1/9 and 2/9). The "primary" node is the node which has both CPU and memory. The "migrate target" node is cpuless node and under DRAM in memory hierarchy (i.e. PMEM may be a suitable one, which has lower bandwidth, higher latency, larger capacity and is cheaper than DRAM). The firmware is effectively going to enforce "cpu-less" nodes for any memory range that has differentiated performance from the conventional memory pool, or differentiated performance for a specific initiator. Defined "N_CPU_MEM" nodemask for the "primary" nodes in order to distinguish with cpuless nodes (memory only, i.e. PMEM nodes) and memoryless nodes (some architectures, i.e. Power, may have memoryless nodes). It is a little bit hard to find out suitable "migrate target" node since this needs firmware exposes the physical characteristics of the memory devices. I'm not quite sure what should be the best way and if it is ready to use now or not. Since PMEM is the only available such device for now, so it sounds retrieving the information from SRAT is the easiest way. We may figure out a better way in the future. The promotion/demotion happens only between "primary" nodes and "migrate target" nodes. No promotion/demotion between "migrate target" nodes and promotion from "primary" nodes to "migrate target" nodes and demotion from "primary" nodes to "migrate target" nodes. This guarantees there is no cycles for memory demotion or promotion. According to the discussion at LFS/MM 2019, "there should only be one node to which pages could be migrated". So reclaim code just tries to demote the pages to the closest "migrate target" node and only tries once. Otherwise "if all nodes in the system were on a fallback list, a page would have to move through every possible option - each RAM-based node and each persistent-memory node - before actually being reclaimed. It would be necessary to maintain the history of where each page has been, and would be likely to disrupt other workloads on the system". This is what v2 patchset does, so keep doing it in the same way in v3. The demotion code moves all the migration candidate pages into one single list, then migrate them together (including THP). This would improve the efficiency of migration according to Zi Yan's research. If the migration fails, the unmigrated pages will be put back to LRU. Use the most opotimistic GFP flags to allocate pages on the "migrate target" node. To reduce the failure rate of demotion, check if the "migrate target" node is contended or not. If the "migrate target" node is contended, just do swap instead of migrate. If migration is failed due to -ENOMEM, mark the node as contended. The contended flag will be cleared once the node get balanced. For now "migrate" mode is not compatible with cpuset and mempolicy since it is hard to get the process's task_struct from struct page. The cpuset and process's mempolicy are stored in task_struct instead of mm_struct. Anonymous page only for the time being since NUMA balancing can't promote unmapped page cache. Page cache can be demoted easily, but promotion is a question, may do it via mark_page_accessed(). Added vmstat counters for pgdemote_kswapd, pgdemote_direct and numa_pages_promoted. There are definitely still a lot of details need to be sorted out. Any comment is welcome. Test ==== The stress test was done with mmtests + applications workload (i.e. sysbench, grep, etc). Generate memory pressure by running mmtest's usemem-stress-numa-compact, then run other applications as workload to stress the promotion and demotion path. The machine was still alive after the stress test had been running for ~30 hours. The /proc/vmstat also shows: ... pgdemote_kswapd 3316563 pgdemote_direct 1930721 ... numa_pages_promoted 81838 [1]: https://lore.kernel.org/linux-mm/20181226131446.330864849@xxxxxxxxx/ [2]: https://lore.kernel.org/linux-mm/20190321200157.29678-1-keith.busch@xxxxxxxxx/ [3]: https://lore.kernel.org/linux-mm/20190404071312.GD12864@xxxxxxxxxxxxxx/T/#me1c1ed102741ba945c57071de9749e16a76e9f3d Yang Shi (9): mm: define N_CPU_MEM node states mm: Introduce migrate target nodemask mm: page_alloc: make find_next_best_node find return migration target node mm: migrate: make migrate_pages() return nr_succeeded mm: vmscan: demote anon DRAM pages to migration target node mm: vmscan: don't demote for memcg reclaim mm: vmscan: check if the demote target node is contended or not mm: vmscan: add page demotion counter mm: numa: add page promotion counter Documentation/sysctl/vm.txt | 6 +++ drivers/acpi/numa.c | 12 +++++ drivers/base/node.c | 4 ++ include/linux/gfp.h | 12 +++++ include/linux/migrate.h | 6 ++- include/linux/mmzone.h | 3 ++ include/linux/nodemask.h | 4 +- include/linux/vm_event_item.h | 3 ++ include/linux/vmstat.h | 1 + include/trace/events/migrate.h | 3 +- mm/compaction.c | 3 +- mm/debug.c | 1 + mm/gup.c | 4 +- mm/huge_memory.c | 4 ++ mm/internal.h | 23 ++++++++ mm/memory-failure.c | 7 ++- mm/memory.c | 4 ++ mm/memory_hotplug.c | 10 +++- mm/mempolicy.c | 7 ++- mm/migrate.c | 33 ++++++++---- mm/page_alloc.c | 20 +++++-- mm/vmscan.c | 186 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++------- mm/vmstat.c | 14 ++++- 23 files changed, 323 insertions(+), 47 deletions(-)