__set_cpus_allowed_ptr() migrates running or runnable, setting the task's cpu accordingly. The CPU is not set when the task is not runnable because of complications on the hotplug code. The task cpu will be updated in the next wakeup anyway. However, this creates a problem for the usage of task_cpu(p), which might point the task to a CPU in which it cannot run, or worse, a runqueue/root_domain it does not belong to, causing some odd errors. For example, the script below shows that a sleeping task cannot become SCHED_DEADLINE if they moved to another (exclusive) cpuset: ----- %< ----- #!/bin/bash # Enter on the cgroup directory cd /sys/fs/cgroup/ # Check it if is cgroup v2 and enable cpuset if [ -e cgroup.subtree_control ]; then # Enable cpuset controller on cgroup v2 echo +cpuset > cgroup.subtree_control fi echo LOG: create an exclusive cpuset and assigned the CPU 0 to it # Create cpuset groups rmdir dl-group &> /dev/null mkdir dl-group # Restrict the task to the CPU 0 echo 0 > dl-group/cpuset.mems echo 0 > dl-group/cpuset.cpus echo root > dl-group/cpuset.cpus.partition echo LOG: dispatching a regular task sleep 100 & CPUSET_PID="$!" # let it settle down sleep 1 # Assign the second task to the cgroup echo LOG: moving the second DL task to the cpuset echo "$CPUSET_PID" > dl-group/cgroup.procs 2> /dev/null CPUSET_ALLOWED=`cat /proc/$CPUSET_PID/status | grep Cpus_allowed_list | awk '{print $2}'` chrt -p -d --sched-period 1000000000 --sched-runtime 100000000 0 $CPUSET_PID ACCEPTED=$? if [ $ACCEPTED == 0 ]; then echo PASS: the task became DL else echo FAIL: the task was rejected as DL fi # Just ignore the clean up exec > /dev/null 2>&1 kill -9 $CPUSET_PID kill -9 $ROOT_PID rmdir dl-group ----- >% ----- Long story short: the sleep task is (not runnable) on a cpu != 0. After moving to a cpuset with only the CPU 0, task_cpu() returns a cpu that does not belong to the cpuset the task is in, and the task is rejected in this if: ----- %< ----- __sched_setscheduler(): [...] rq = task_rq_lock(p, &rf); <-- uses task_cpu(), that points to <-- the old cpu. [...] if (dl_bandwidth_enabled() && dl_policy(policy) && !(attr->sched_flags & SCHED_FLAG_SUGOV)) { cpumask_t *span = rq->rd->span; <--- wrong rd! /* * Don't allow tasks with an affinity mask smaller than * the entire root_domain to become SCHED_DEADLINE. We * will also fail if there's no bandwidth available. */ if (!cpumask_subset(span, p->cpus_ptr) || rq->rd->dl_bw.bw == 0) { retval = -EPERM; <--- returns here. goto unlock; } } ----- >% ----- Because the rq, and so the root domain, corresponding to the ones of the CPU in which the sleep command went to... sleep, not the ones it will run in the next wakeup because of its affinity. To avoid this problem, use the dl_task* helpers that return the task cpu, root domain, and the "root" dl_bw, aware of the status of task->cpu. Reported-by: Marco Perronet <perronet@xxxxxxxxxxx> Signed-off-by: Daniel Bristot de Oliveira <bristot@xxxxxxxxxx> Cc: Ingo Molnar <mingo@xxxxxxxxxx> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> Cc: Juri Lelli <juri.lelli@xxxxxxxxxx> Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx> Cc: Dietmar Eggemann <dietmar.eggemann@xxxxxxx> Cc: Steven Rostedt <rostedt@xxxxxxxxxxx> Cc: Ben Segall <bsegall@xxxxxxxxxx> Cc: Mel Gorman <mgorman@xxxxxxx> Cc: Daniel Bristot de Oliveira <bristot@xxxxxxxxxx> Cc: Li Zefan <lizefan@xxxxxxxxxx> Cc: Tejun Heo <tj@xxxxxxxxxx> Cc: Johannes Weiner <hannes@xxxxxxxxxxx> Cc: Valentin Schneider <valentin.schneider@xxxxxxx> Cc: linux-kernel@xxxxxxxxxxxxxxx Cc: cgroups@xxxxxxxxxxxxxxx --- kernel/sched/core.c | 6 +++--- kernel/sched/deadline.c | 4 ++-- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 5961a97541c2..3c2775e6869f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5905,15 +5905,15 @@ static int __sched_setscheduler(struct task_struct *p, #ifdef CONFIG_SMP if (dl_bandwidth_enabled() && dl_policy(policy) && !(attr->sched_flags & SCHED_FLAG_SUGOV)) { - cpumask_t *span = rq->rd->span; + struct root_domain *rd = dl_task_rd(p); /* * Don't allow tasks with an affinity mask smaller than * the entire root_domain to become SCHED_DEADLINE. We * will also fail if there's no bandwidth available. */ - if (!cpumask_subset(span, p->cpus_ptr) || - rq->rd->dl_bw.bw == 0) { + if (!cpumask_subset(rd->span, p->cpus_ptr) || + rd->dl_bw.bw == 0) { retval = -EPERM; goto unlock; } diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index c221e14d5b86..1f6264cb8867 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -2678,8 +2678,8 @@ int sched_dl_overflow(struct task_struct *p, int policy, u64 period = attr->sched_period ?: attr->sched_deadline; u64 runtime = attr->sched_runtime; u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0; - int cpus, err = -1, cpu = task_cpu(p); - struct dl_bw *dl_b = dl_bw_of(cpu); + int cpus, err = -1, cpu = dl_task_cpu(p); + struct dl_bw *dl_b = dl_task_root_bw(p); unsigned long cap; if (attr->sched_flags & SCHED_FLAG_SUGOV) -- 2.29.2