It was found that any change to the current cpuset hierarchy may reset the cpus_allowed list of the tasks in the affected cpusets to the default cpuset value even if those tasks have cpus affinity explicitly set by the users before. That is especially easy to trigger under a cgroup v2 environment where writing "+cpuset" to the root cgroup's cgroup.subtree_control file will reset the cpus affinity of all the processes in the system. That is especially problematic in a nohz_full environment where the tasks running in the nohz_full CPUs usually have their cpus affinity explicitly set and will behave incorrectly if cpus affinity changes. Fix this problem by adding a flag in the task structure to indicate that a task has their cpus affinity explicitly set before and make cpuset code not to change their cpus_allowed list unless the user chosen cpu list is no longer a subset of the cpus_allowed list of the cpuset itself. With that change in place, it was verified that tasks that have its cpus affinity explicitly set will not be affected by changes made to the v2 cgroup.subtree_control files. Signed-off-by: Waiman Long <longman@xxxxxxxxxx> --- include/linux/sched.h | 1 + kernel/cgroup/cpuset.c | 18 ++++++++++++++++-- kernel/sched/core.c | 1 + 3 files changed, 18 insertions(+), 2 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index c46f3a63b758..60ae022fa842 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -815,6 +815,7 @@ struct task_struct { unsigned int policy; int nr_cpus_allowed; + int cpus_affinity_set; const cpumask_t *cpus_ptr; cpumask_t *user_cpus_ptr; cpumask_t cpus_mask; diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 71a418858a5e..c47757c61f39 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -704,6 +704,20 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial) return ret; } +/* + * Don't change the cpus_allowed list if cpus affinity has been explicitly + * set before unless the current cpu list is not a subset of the new cpu list. + */ +static int cpuset_set_cpus_allowed_ptr(struct task_struct *p, + const struct cpumask *new_mask) +{ + if (p->cpus_affinity_set && cpumask_subset(p->cpus_ptr, new_mask)) + return 0; + + p->cpus_affinity_set = 0; + return set_cpus_allowed_ptr(p, new_mask); +} + #ifdef CONFIG_SMP /* * Helper routine for generate_sched_domains(). @@ -1130,7 +1144,7 @@ static void update_tasks_cpumask(struct cpuset *cs) css_task_iter_start(&cs->css, 0, &it); while ((task = css_task_iter_next(&it))) - set_cpus_allowed_ptr(task, cs->effective_cpus); + cpuset_set_cpus_allowed_ptr(task, cs->effective_cpus); css_task_iter_end(&it); } @@ -2303,7 +2317,7 @@ static void cpuset_attach(struct cgroup_taskset *tset) * can_attach beforehand should guarantee that this doesn't * fail. TODO: have a better way to handle failure here */ - WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpus_attach)); + WARN_ON_ONCE(cpuset_set_cpus_allowed_ptr(task, cpus_attach)); cpuset_change_task_nodemask(task, &cpuset_attach_nodemask_to); cpuset_update_task_spread_flag(cs, task); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index da0bf6fe9ecd..ab8ea6fa92db 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -8034,6 +8034,7 @@ __sched_setaffinity(struct task_struct *p, const struct cpumask *mask) if (retval) goto out_free_new_mask; + p->cpus_affinity_set = 1; cpuset_cpus_allowed(p, cpus_allowed); if (!cpumask_subset(new_mask, cpus_allowed)) { /* -- 2.31.1