On 6/30/20 5:47 PM, David Rientjes wrote: > On Mon, 29 Jun 2020, Dave Hansen wrote: >> From: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx> >> >> If a memory node has a preferred migration path to demote cold pages, >> attempt to move those inactive pages to that migration node before >> reclaiming. This will better utilize available memory, provide a faster >> tier than swapping or discarding, and allow such pages to be reused >> immediately without IO to retrieve the data. >> >> When handling anonymous pages, this will be considered before swap if >> enabled. Should the demotion fail for any reason, the page reclaim >> will proceed as if the demotion feature was not enabled. >> > > Thanks for sharing these patches and kick-starting the conversation, Dave. > > Could this cause us to break a user's mbind() or allow a user to > circumvent their cpuset.mems? In its current form, yes. My current rationale for this is that while it's not as deferential as it can be to the user/kernel ABI contract, it's good *overall* behavior. The auto-migration only kicks in when the data is about to go away. So while the user's data might be slower than they like, it is *WAY* faster than they deserve because it should be off on the disk. > Because we don't have a mapping of the page back to its allocation > context (or the process context in which it was allocated), it seems like > both are possible. > > So let's assume that migration nodes cannot be other DRAM nodes. > Otherwise, memory pressure could be intentionally or unintentionally > induced to migrate these pages to another node. Do we have such a > restriction on migration nodes? There's nothing explicit. On a normal, balanced system where there's a 1:1:1 relationship between CPU sockets, DRAM nodes and PMEM nodes, it's implicit since the migration path is one deep and goes from DRAM->PMEM. If there were some oddball system where there was a memory only DRAM node, it might very well end up being a migration target. >> Some places we would like to see this used: >> >> 1. Persistent memory being as a slower, cheaper DRAM replacement >> 2. Remote memory-only "expansion" NUMA nodes >> 3. Resolving memory imbalances where one NUMA node is seeing more >> allocation activity than another. This helps keep more recent >> allocations closer to the CPUs on the node doing the allocating. > > (3) is the concerning one given the above if we are to use > migrate_demote_mapping() for DRAM node balancing. Yeah, agreed. That's the sketchiest of the three. :) >> +static struct page *alloc_demote_node_page(struct page *page, unsigned long node) >> +{ >> + /* >> + * 'mask' targets allocation only to the desired node in the >> + * migration path, and fails fast if the allocation can not be >> + * immediately satisfied. Reclaim is already active and heroic >> + * allocation efforts are unwanted. >> + */ >> + gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY | >> + __GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_HIGHMEM | >> + __GFP_MOVABLE; > > GFP_NOWAIT has the side-effect that it does __GFP_KSWAPD_RECLAIM: do we > actually want to kick kswapd on the pmem node? In my mental model, cold data flows from: DRAM -> PMEM -> swap Kicking kswapd here ensures that while we're doing DRAM->PMEM migrations for kinda cold data, kswapd can be working on doing the PMEM->swap part on really cold data. ... >> @@ -1229,6 +1230,30 @@ static unsigned long shrink_page_list(st >> ; /* try to reclaim the page below */ >> } >> >> + rc = migrate_demote_mapping(page); >> + /* >> + * -ENOMEM on a THP may indicate either migration is >> + * unsupported or there was not enough contiguous >> + * space. Split the THP into base pages and retry the >> + * head immediately. The tail pages will be considered >> + * individually within the current loop's page list. >> + */ >> + if (rc == -ENOMEM && PageTransHuge(page) && >> + !split_huge_page_to_list(page, page_list)) >> + rc = migrate_demote_mapping(page); >> + >> + if (rc == MIGRATEPAGE_SUCCESS) { >> + unlock_page(page); >> + if (likely(put_page_testzero(page))) >> + goto free_it; >> + /* >> + * Speculative reference will free this page, >> + * so leave it off the LRU. >> + */ >> + nr_reclaimed++; > > nr_reclaimed += nr_pages instead? Oh, good catch. I also need to go double-check that 'nr_pages' isn't wrong elsewhere because of the split.