On Wed, Oct 26, 2022 at 10:36:32PM +0800, Waiman Long wrote: > On 10/26/22 03:43, Feng Tang wrote: > > In page reclaim path, memory could be demoted from faster memory tier > > to slower memory tier. Currently, there is no check about cpuset's > > memory policy, that even if the target demotion node is not allowd > > by cpuset, the demotion will still happen, which breaks the cpuset > > semantics. > > > > So add cpuset policy check in the demotion path and skip demotion > > if the demotion targets are not allowed by cpuset. > > > > Signed-off-by: Feng Tang <feng.tang@xxxxxxxxx> > > --- > > Hi reviewers, > > > > For easy bisectable, I combined the cpuset change and mm change > > in one patch, if you prefer to separate them, I can turn it into > > 2 patches. > > > > Thanks, > > Feng > > > > include/linux/cpuset.h | 6 ++++++ > > kernel/cgroup/cpuset.c | 29 +++++++++++++++++++++++++++++ > > mm/vmscan.c | 35 ++++++++++++++++++++++++++++++++--- > > 3 files changed, 67 insertions(+), 3 deletions(-) > > > > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h > > index d58e0476ee8e..6fcce2bd2631 100644 > > --- a/include/linux/cpuset.h > > +++ b/include/linux/cpuset.h > > @@ -178,6 +178,8 @@ static inline void set_mems_allowed(nodemask_t nodemask) > > task_unlock(current); > > } > > > > +extern void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup, > > + nodemask_t *nmask); > > #else /* !CONFIG_CPUSETS */ > > > > static inline bool cpusets_enabled(void) { return false; } > > @@ -299,6 +301,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq) > > return false; > > } > > > > +static inline void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup, > > + nodemask_t *nmask) > > +{ > > +} > > #endif /* !CONFIG_CPUSETS */ > > > > #endif /* _LINUX_CPUSET_H */ > > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > > index 3ea2e836e93e..cbb118c0502f 100644 > > --- a/kernel/cgroup/cpuset.c > > +++ b/kernel/cgroup/cpuset.c > > @@ -3750,6 +3750,35 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk) > > return mask; > > } > > > > +/* > > + * Retrieve the allowed memory nodemask for a cgroup. > > + * > > + * Set *nmask to cpuset's effective allowed nodemask for cgroup v2, > > + * and NODE_MASK_ALL (means no constraint) for cgroup v1 where there > > + * is no guaranteed association from a cgroup to a cpuset. > > + */ > > +void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup, nodemask_t *nmask) > > +{ > > + struct cgroup_subsys_state *css; > > + struct cpuset *cs; > > + > > + if (!is_in_v2_mode()) { > > + *nmask = NODE_MASK_ALL; > > + return; > > + } > > You are allowing all nodes to be used for cgroup v1. Is there a reason > why you ignore v1? The use case for the API is, for a memory control group, user want to get its associated cpuset controller's memory policy, so it tries the memcg --> cgroup --> cpuset chain. IIUC, there is no a reliable chain for cgroup v1, plus cgroup v2 is the default option for many distros, the cgroup v1 is bypassed here. > > + > > + rcu_read_lock(); > > + css = cgroup_e_css(cgroup, &cpuset_cgrp_subsys); > > + if (css) { > > + css_get(css); > > + cs = css_cs(css); > > + *nmask = cs->effective_mems; > > + css_put(css); > > + } > Since you are holding an RCU read lock and copying out the whole > nodemask, you probably don't need to do a css_get/css_put pair. Thanks for the note! Thanks, Feng > > + > > + rcu_read_unlock(); > > +} > > + > Cheers, > > Longman >