Feng Tang <feng.tang@xxxxxxxxx> writes: > On Thu, Oct 27, 2022 at 01:13:30PM +0800, Huang, Ying wrote: >> Feng Tang <feng.tang@xxxxxxxxx> writes: >> >> > In page reclaim path, memory could be demoted from faster memory tier >> > to slower memory tier. Currently, there is no check about cpuset's >> > memory policy, that even if the target demotion node is not allowd >> > by cpuset, the demotion will still happen, which breaks the cpuset >> > semantics. >> > >> > So add cpuset policy check in the demotion path and skip demotion >> > if the demotion targets are not allowed by cpuset. >> > >> > Signed-off-by: Feng Tang <feng.tang@xxxxxxxxx> >> > --- > [...] >> > index 18f6497994ec..c205d98283bc 100644 >> > --- a/mm/vmscan.c >> > +++ b/mm/vmscan.c >> > @@ -1537,9 +1537,21 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private) >> > { >> > struct page *target_page; >> > nodemask_t *allowed_mask; >> > - struct migration_target_control *mtc; >> > + struct migration_target_control *mtc = (void *)private; >> > >> > - mtc = (struct migration_target_control *)private; >> >> I think we should avoid (void *) conversion here. > > OK, will change back. > >> > +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS) >> > + struct mem_cgroup *memcg; >> > + nodemask_t cpuset_nmask; >> > + >> > + memcg = page_memcg(page); >> > + cpuset_get_allowed_mem_nodes(memcg->css.cgroup, &cpuset_nmask); >> > + >> > + if (!node_isset(mtc->nid, cpuset_nmask)) { >> > + if (mtc->nmask) >> > + nodes_and(*mtc->nmask, *mtc->nmask, cpuset_nmask); >> > + return alloc_migration_target(page, (unsigned long)mtc); >> > + } >> >> If node_isset(mtc->nid, cpuset_nmask) == true, we should keep the >> original 2 steps allocation and apply nodes_and() on node mask. > > Good catch! Yes, the nodes_and() call should be taken out from this > check and done before calling node_isset(). > >> > +#endif >> > >> > allowed_mask = mtc->nmask; >> > /* >> > @@ -1649,6 +1661,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, >> > enum folio_references references = FOLIOREF_RECLAIM; >> > bool dirty, writeback; >> > unsigned int nr_pages; >> > + bool skip_this_demotion = false; >> > >> > cond_resched(); >> > >> > @@ -1658,6 +1671,22 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, >> > if (!folio_trylock(folio)) >> > goto keep; >> > >> > +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS) >> > + if (do_demote_pass) { >> > + struct mem_cgroup *memcg; >> > + nodemask_t nmask, nmask1; >> > + >> > + node_get_allowed_targets(pgdat, &nmask); >> >> pgdat will not change in the loop, so we can move this out of the loop? > > Yes > >> > + memcg = folio_memcg(folio); >> > + if (memcg) >> > + cpuset_get_allowed_mem_nodes(memcg->css.cgroup, >> > + &nmask1); >> > + >> > + if (!nodes_intersects(nmask, nmask1)) >> > + skip_this_demotion = true; >> > + } >> >> If nodes_intersects() == true, we will call >> cpuset_get_allowed_mem_nodes() twice. Better to pass the intersecting >> mask to demote_folio_list()? > > The pages in the loop may come from different mem control group, and > the cpuset's nodemask could be different, I don't know how to save > this per-page info to be used later in demote_folio_list. Yes. You are right. We cannot do that. Best Regards, Huang, Ying > >> > +#endif >> > + >> > VM_BUG_ON_FOLIO(folio_test_active(folio), folio); >> > >> > nr_pages = folio_nr_pages(folio); >> > @@ -1799,7 +1828,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, >> > * Before reclaiming the folio, try to relocate >> > * its contents to another node. >> > */ >> > - if (do_demote_pass && >> > + if (do_demote_pass && !skip_this_demotion && >> > (thp_migration_supported() || !folio_test_large(folio))) { >> > list_add(&folio->lru, &demote_folios); >> > folio_unlock(folio); >> >> Best Regards, >> Huang, Ying