[PATCH 6/6] sched/deadline: Fixes cpu/rd/dl_bw references for suspended tasks

Daniel Bristot de Oliveira <bristot@xxxxxxxxxx> · Tue, 12 Jan 2021 16:53:45 +0100

__set_cpus_allowed_ptr() migrates running or runnable, setting the
task's cpu accordingly. The CPU is not set when the task is not
runnable because of complications on the hotplug code. The
task cpu will be updated in the next wakeup anyway.

However, this creates a problem for the usage of task_cpu(p), which
might point the task to a CPU in which it cannot run, or worse,
a runqueue/root_domain it does not belong to, causing some
odd errors. For example, the script below shows that a sleeping
task cannot become SCHED_DEADLINE if they moved to another (exclusive)
cpuset:

----- %< -----
  #!/bin/bash
  # Enter on the cgroup directory
  cd /sys/fs/cgroup/

  # Check it if is cgroup v2 and enable cpuset
  if [ -e cgroup.subtree_control ]; then
  	# Enable cpuset controller on cgroup v2
  	echo +cpuset > cgroup.subtree_control
  fi

  echo LOG: create an exclusive cpuset and assigned the CPU 0 to it
  # Create cpuset groups
  rmdir dl-group &> /dev/null
  mkdir dl-group

  # Restrict the task to the CPU 0
  echo 0 > dl-group/cpuset.mems
  echo 0 > dl-group/cpuset.cpus
  echo root >  dl-group/cpuset.cpus.partition

  echo LOG: dispatching a regular task
  sleep 100 &
  CPUSET_PID="$!"

  # let it settle down
  sleep 1

  # Assign the second task to the cgroup
  echo LOG: moving the second DL task to the cpuset
  echo "$CPUSET_PID" > dl-group/cgroup.procs 2> /dev/null

  CPUSET_ALLOWED=`cat /proc/$CPUSET_PID/status | grep Cpus_allowed_list | awk '{print $2}'`

  chrt -p -d --sched-period 1000000000 --sched-runtime 100000000 0 $CPUSET_PID
  ACCEPTED=$?

  if [ $ACCEPTED == 0 ]; then
  	echo PASS: the task became DL
  else
  	echo FAIL: the task was rejected as DL
  fi

  # Just ignore the clean up
  exec > /dev/null 2>&1
  kill -9 $CPUSET_PID
  kill -9 $ROOT_PID
  rmdir dl-group
----- >% -----

Long story short: the sleep task is (not runnable) on a cpu != 0. After
moving to a cpuset with only the CPU 0, task_cpu() returns a cpu that
does not belong to the cpuset the task is in, and the task is rejected
in this if:

----- %< -----
__sched_setscheduler():
[...]
        rq = task_rq_lock(p, &rf); <-- uses task_cpu(), that points to
				   <-- the old cpu.
[...]
                if (dl_bandwidth_enabled() && dl_policy(policy) &&
                                !(attr->sched_flags & SCHED_FLAG_SUGOV)) {
                        cpumask_t *span = rq->rd->span;        <--- wrong rd!

                        /*
                         * Don't allow tasks with an affinity mask smaller than
                         * the entire root_domain to become SCHED_DEADLINE. We
                         * will also fail if there's no bandwidth available.
                         */
                        if (!cpumask_subset(span, p->cpus_ptr) ||
                            rq->rd->dl_bw.bw == 0) {
                                retval = -EPERM;   <--- returns here.
                                goto unlock;
                        }
                }
----- >% -----

Because the rq, and so the root domain, corresponding to the ones of
the CPU in which the sleep command went to... sleep, not the ones
it will run in the next wakeup because of its affinity.

To avoid this problem, use the dl_task* helpers that return the task
cpu, root domain, and the "root" dl_bw, aware of the status of
task->cpu.

Reported-by: Marco Perronet <perronet@xxxxxxxxxxx>
Signed-off-by: Daniel Bristot de Oliveira <bristot@xxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Juri Lelli <juri.lelli@xxxxxxxxxx>
Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
Cc: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>
Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
Cc: Ben Segall <bsegall@xxxxxxxxxx>
Cc: Mel Gorman <mgorman@xxxxxxx>
Cc: Daniel Bristot de Oliveira <bristot@xxxxxxxxxx>
Cc: Li Zefan <lizefan@xxxxxxxxxx>
Cc: Tejun Heo <tj@xxxxxxxxxx>
Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
Cc: Valentin Schneider <valentin.schneider@xxxxxxx>
Cc: linux-kernel@xxxxxxxxxxxxxxx
Cc: cgroups@xxxxxxxxxxxxxxx
---
 kernel/sched/core.c     | 6 +++---
 kernel/sched/deadline.c | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5961a97541c2..3c2775e6869f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5905,15 +5905,15 @@ static int __sched_setscheduler(struct task_struct *p,
 #ifdef CONFIG_SMP
 		if (dl_bandwidth_enabled() && dl_policy(policy) &&
 				!(attr->sched_flags & SCHED_FLAG_SUGOV)) {
-			cpumask_t *span = rq->rd->span;
+			struct root_domain *rd = dl_task_rd(p);
 
 			/*
 			 * Don't allow tasks with an affinity mask smaller than
 			 * the entire root_domain to become SCHED_DEADLINE. We
 			 * will also fail if there's no bandwidth available.
 			 */
-			if (!cpumask_subset(span, p->cpus_ptr) ||
-			    rq->rd->dl_bw.bw == 0) {
+			if (!cpumask_subset(rd->span, p->cpus_ptr) ||
+			    rd->dl_bw.bw == 0) {
 				retval = -EPERM;
 				goto unlock;
 			}
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index c221e14d5b86..1f6264cb8867 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2678,8 +2678,8 @@ int sched_dl_overflow(struct task_struct *p, int policy,
 	u64 period = attr->sched_period ?: attr->sched_deadline;
 	u64 runtime = attr->sched_runtime;
 	u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
-	int cpus, err = -1, cpu = task_cpu(p);
-	struct dl_bw *dl_b = dl_bw_of(cpu);
+	int cpus, err = -1, cpu = dl_task_cpu(p);
+	struct dl_bw *dl_b = dl_task_root_bw(p);
 	unsigned long cap;
 
 	if (attr->sched_flags & SCHED_FLAG_SUGOV)
-- 
2.29.2