[CC cpuset people] On Tue 04-04-23 19:55:09, Gang Li wrote: > When a process in cpuset triggers oom, it may kill a completely > irrelevant process on another numa node, which will not release any > memory for this cpuset. > > It seems that `CONSTRAINT_CPUSET` is not really doing much these > days. Using CONSTRAINT_CPUSET, we can easily achieve node aware oom > by selecting victim from all cpusets with the same mems_allowed as > the current cpuset. This should go into more details about the usecase, testing and ideally also spend couple of words about how CONSTRAINT_CPUSET is actually implemented because this is not really immediately obvious. An example of before/after behavior would have been really nice as well. You should also go into much more details about how oom victims are actually evaluated. > Suggested-by: Michal Hocko <mhocko@xxxxxxxx> > Signed-off-by: Gang Li <ligang.bdlg@xxxxxxxxxxxxx> > --- > Changes in v2: > - Select victim from all cpusets with the same mems_allowed as the current cpuset. > (David Rientjes <rientjes@xxxxxxxxxx>) > > v1: > - https://lore.kernel.org/all/20220921064710.89663-1-ligang.bdlg@xxxxxxxxxxxxx/ > > --- > include/linux/cpuset.h | 6 ++++++ > kernel/cgroup/cpuset.c | 28 ++++++++++++++++++++++++++++ > mm/oom_kill.c | 4 ++++ > 3 files changed, 38 insertions(+) As this is a userspace visible change it should also be documented somewhere in Documentation. I am not really familiar with cpusets internals so I cannot really judge cpuset_cgroup_scan_tasks implementation. The oom report should be explicit about this being CPUSET specific oom handling so unexpected behavior could be nailed down to this change so I do not see a major concern from the oom POV. Nevertheless it would be still good to consider whether this should be an opt-in behavior. I personally do not see a major problem because most cpuset deployments I have seen tend to be well partitioned so the new behavior makes more sense. > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h > index 980b76a1237e..fc244141bd52 100644 > --- a/include/linux/cpuset.h > +++ b/include/linux/cpuset.h > @@ -171,6 +171,8 @@ static inline void set_mems_allowed(nodemask_t nodemask) > task_unlock(current); > } > > +int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg); > + > #else /* !CONFIG_CPUSETS */ > > static inline bool cpusets_enabled(void) { return false; } > @@ -287,6 +289,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq) > return false; > } > > +static inline int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg) > +{ > + return 0; > +} > #endif /* !CONFIG_CPUSETS */ > > #endif /* _LINUX_CPUSET_H */ > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > index bc4dcfd7bee5..b009c98ca19e 100644 > --- a/kernel/cgroup/cpuset.c > +++ b/kernel/cgroup/cpuset.c > @@ -4013,6 +4013,34 @@ void cpuset_print_current_mems_allowed(void) > rcu_read_unlock(); > } > > +int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg) > +{ > + int ret = 0; > + struct css_task_iter it; > + struct task_struct *task; > + struct cpuset *cs; > + struct cgroup_subsys_state *pos_css; > + > + /* > + * Situation gets complex with overlapping nodemasks in different cpusets. > + * TODO: Maybe we should calculate the "distance" between different mems_allowed. > + * > + * But for now, let's make it simple. Just iterate through all cpusets > + * with the same mems_allowed as the current cpuset. > + */ > + rcu_read_lock(); > + cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) { > + if (nodes_equal(cs->mems_allowed, task_cs(current)->mems_allowed)) { > + css_task_iter_start(&(cs->css), CSS_TASK_ITER_PROCS, &it); > + while (!ret && (task = css_task_iter_next(&it))) > + ret = fn(task, arg); > + css_task_iter_end(&it); > + } > + } > + rcu_read_unlock(); > + return ret; > +} > + > /* > * Collection of memory_pressure is suppressed unless > * this flag is enabled by writing "1" to the special > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 044e1eed720e..205982a69b30 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -367,6 +367,8 @@ static void select_bad_process(struct oom_control *oc) > > if (is_memcg_oom(oc)) > mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc); > + else if (oc->constraint == CONSTRAINT_CPUSET) > + cpuset_cgroup_scan_tasks(oom_evaluate_task, oc); > else { > struct task_struct *p; > > @@ -427,6 +429,8 @@ static void dump_tasks(struct oom_control *oc) > > if (is_memcg_oom(oc)) > mem_cgroup_scan_tasks(oc->memcg, dump_task, oc); > + else if (oc->constraint == CONSTRAINT_CPUSET) > + cpuset_cgroup_scan_tasks(dump_task, oc); > else { > struct task_struct *p; > > -- > 2.20.1 -- Michal Hocko SUSE Labs