On Wed, Oct 07, 2020 at 09:17:45AM -0700, Dave Hansen wrote: > Signed-off-by: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx> > Cc: Yang Shi <yang.shi@xxxxxxxxxxxxxxxxx> > Cc: David Rientjes <rientjes@xxxxxxxxxx> > Cc: Huang Ying <ying.huang@xxxxxxxxx> > Cc: Dan Williams <dan.j.williams@xxxxxxxxx> I am still going through all the details, but just my thoughts on things that caught my eye: > --- a/include/linux/migrate.h~demote-with-migrate_pages 2020-10-07 09:15:31.028642442 -0700 > +++ b/include/linux/migrate.h 2020-10-07 09:15:31.034642442 -0700 > @@ -27,6 +27,7 @@ enum migrate_reason { > MR_MEMPOLICY_MBIND, > MR_NUMA_MISPLACED, > MR_CONTIG_RANGE, > + MR_DEMOTION, > MR_TYPES I think you also need to add it under include/trace/events/migrate.h, so mm_migrate_pages event can know about it. > +bool migrate_demote_page_ok(struct page *page, struct scan_control *sc) Make it static? Also, scan_control seems to be unused here. > +{ > + int next_nid = next_demotion_node(page_to_nid(page)); > + > + VM_BUG_ON_PAGE(!PageLocked(page), page); Right after the call to migrate_demote_page_ok, we call unlock_page which already has this check in place. I know that this is only to be on the safe side and we do not loss anything, but just my thoughts. > +static struct page *alloc_demote_page(struct page *page, unsigned long node) > +{ > + /* > + * Try to fail quickly if memory on the target node is not > + * available. Leaving out __GFP_IO and __GFP_FS helps with > + * this. If the desintation node is full, we want kswapd to > + * run there so that its pages will get reclaimed and future > + * migration attempts may succeed. > + */ > + gfp_t flags = (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_NORETRY | > + __GFP_NOMEMALLOC | __GFP_NOWARN | __GFP_THISNODE | > + __GFP_KSWAPD_RECLAIM); I think it would be nicer to have this as a real GFP_ thingy defined. e.g: GFP_DEMOTION > + /* HugeTLB pages should not be on the LRU */ > + WARN_ON_ONCE(PageHuge(page)); I am not sure about this one. This could only happen if the page, which now it is in another list, ends up in the buddy system. That is quite unlikely bth. And nevertheless, this is only a warning, which means that if this scenario gets to happen, we will be allocating a single page to satisfy a higher-order page, and I am not sure about the situation we will end up with. > + > + if (PageTransHuge(page)) { > + struct page *thp; > + > + flags |= __GFP_COMP; > + > + thp = alloc_pages_node(node, flags, HPAGE_PMD_ORDER); > + if (!thp) > + return NULL; > + prep_transhuge_page(thp); > + return thp; > + } > + > + return __alloc_pages_node(node, flags, 0); Would make sense to transform this in some sort of new_demotion_page, which actually calls alloc_migration_target with the right stuff in place? And then pass a struct migration_target_control so alloc_migration_target does the right thing. alloc_migration_target also takes care of calling prep_transhuge_page when needed. e.g: static struct page *new_demotion_node(struct page *page, unsigned long private) { struct migration_target_control mtc = { .nid = private, .gfp_mask = GFP_DEMOTION, }; if (PageTransHuge(page)) mtc.gfp_mask |= __GFP_COMP; return alloc_migration_target(page, (unsigned long)&mtc); } The only thing I see is that alloc_migration_target seems to "override" the gfp_mask and does ORs GFP_TRANSHUGE for THP pages, which includes __GFP_DIRECT_RECLAIM (not appreciated in this case). But maybe this can be worked around by checking if gfp_mask == GFP_DEMOTION, and if so, just keep the mask as it is. > + > + if (list_empty(demote_pages)) > + return 0; > + > + /* Demotion ignores all cpuset and mempolicy settings */ > + err = migrate_pages(demote_pages, alloc_demote_page, NULL, > + target_nid, MIGRATE_ASYNC, MR_DEMOTION, > + &nr_succeeded); As I said, instead of alloc_demote_page, use a new_demote_page and make alloc_migration_target handle the allocations and prep thp pages. -- Oscar Salvador SUSE L3