On Thu, Oct 27, 2022 at 01:13:30PM +0800, Huang, Ying wrote: > Feng Tang <feng.tang@xxxxxxxxx> writes: > > > In page reclaim path, memory could be demoted from faster memory tier > > to slower memory tier. Currently, there is no check about cpuset's > > memory policy, that even if the target demotion node is not allowd > > by cpuset, the demotion will still happen, which breaks the cpuset > > semantics. > > > > So add cpuset policy check in the demotion path and skip demotion > > if the demotion targets are not allowed by cpuset. > > > > Signed-off-by: Feng Tang <feng.tang@xxxxxxxxx> > > --- [...] > > index 18f6497994ec..c205d98283bc 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -1537,9 +1537,21 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private) > > { > > struct page *target_page; > > nodemask_t *allowed_mask; > > - struct migration_target_control *mtc; > > + struct migration_target_control *mtc = (void *)private; > > > > - mtc = (struct migration_target_control *)private; > > I think we should avoid (void *) conversion here. OK, will change back. > > +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS) > > + struct mem_cgroup *memcg; > > + nodemask_t cpuset_nmask; > > + > > + memcg = page_memcg(page); > > + cpuset_get_allowed_mem_nodes(memcg->css.cgroup, &cpuset_nmask); > > + > > + if (!node_isset(mtc->nid, cpuset_nmask)) { > > + if (mtc->nmask) > > + nodes_and(*mtc->nmask, *mtc->nmask, cpuset_nmask); > > + return alloc_migration_target(page, (unsigned long)mtc); > > + } > > If node_isset(mtc->nid, cpuset_nmask) == true, we should keep the > original 2 steps allocation and apply nodes_and() on node mask. Good catch! Yes, the nodes_and() call should be taken out from this check and done before calling node_isset(). > > +#endif > > > > allowed_mask = mtc->nmask; > > /* > > @@ -1649,6 +1661,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > enum folio_references references = FOLIOREF_RECLAIM; > > bool dirty, writeback; > > unsigned int nr_pages; > > + bool skip_this_demotion = false; > > > > cond_resched(); > > > > @@ -1658,6 +1671,22 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > if (!folio_trylock(folio)) > > goto keep; > > > > +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS) > > + if (do_demote_pass) { > > + struct mem_cgroup *memcg; > > + nodemask_t nmask, nmask1; > > + > > + node_get_allowed_targets(pgdat, &nmask); > > pgdat will not change in the loop, so we can move this out of the loop? Yes > > + memcg = folio_memcg(folio); > > + if (memcg) > > + cpuset_get_allowed_mem_nodes(memcg->css.cgroup, > > + &nmask1); > > + > > + if (!nodes_intersects(nmask, nmask1)) > > + skip_this_demotion = true; > > + } > > If nodes_intersects() == true, we will call > cpuset_get_allowed_mem_nodes() twice. Better to pass the intersecting > mask to demote_folio_list()? The pages in the loop may come from different mem control group, and the cpuset's nodemask could be different, I don't know how to save this per-page info to be used later in demote_folio_list. Thanks, Feng > > +#endif > > + > > VM_BUG_ON_FOLIO(folio_test_active(folio), folio); > > > > nr_pages = folio_nr_pages(folio); > > @@ -1799,7 +1828,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > * Before reclaiming the folio, try to relocate > > * its contents to another node. > > */ > > - if (do_demote_pass && > > + if (do_demote_pass && !skip_this_demotion && > > (thp_migration_supported() || !folio_test_large(folio))) { > > list_add(&folio->lru, &demote_folios); > > folio_unlock(folio); > > Best Regards, > Huang, Ying