The current SCHED_DEADLINE design supports only global scheduler, or variants of it, i.e., clustered and partitioned, via cpuset config. To enable the partitioning of a system with clusters of CPUs, the documentation advises the usage of exclusive cpusets, creating an exclusive root_domain for the cpuset. Attempts to change the cpu affinity of a thread to a cpu mask different from the root domain results in an error. For instance: ----- %< ----- [root@x1 linux]# chrt -d --sched-period 1000000000 --sched-runtime 100000000 0 sleep 10000 & [1] 69020 [root@x1 linux]# taskset -p -c 0 69020 pid 69020's current affinity list: 0-7 taskset: failed to set pid 69020's affinity: Device or resource busy ----- >% ----- However, such restriction can be bypassed by disabling the SCHED_DEADLINE admission test, under the assumption that the user is aware of the implications of such a decision. However, Marco Perronet noticed that it was possible to by-pass this mechanism because no restriction is currently imposed by the cpuset mechanism. For instance, this script: ----- %< ----- #!/bin/bash # Enter on the cgroup directory cd /sys/fs/cgroup/ # Check it if is cgroup v2 and enable cpuset if [ -e cgroup.subtree_control ]; then # Enable cpuset controller on cgroup v2 echo +cpuset > cgroup.subtree_control fi echo LOG: create a cpuset and assigned the CPU 0 to it # Create cpuset groups rmdir dl-group &> /dev/null mkdir dl-group # Restrict the task to the CPU 0 echo 0 > dl-group/cpuset.mems echo 0 > dl-group/cpuset.cpus # Place a task in the root cgroup echo LOG: dispatching the first DL task chrt -d --sched-period 1000000000 --sched-runtime 100000000 0 sleep 100 & ROOT_PID="$!" ROOT_ALLOWED=`cat /proc/$ROOT_PID/status | grep Cpus_allowed_list | awk '{print $2}'` # Disapatch another task in the root cgroup, to move it later. echo LOG: dispatching the second DL task chrt -d --sched-period 1000000000 --sched-runtime 100000000 0 sleep 100 & CPUSET_PID="$!" # let them settle down sleep 1 # Assign the second task to the cgroup echo LOG: moving the second DL task to the cpuset echo "$CPUSET_PID" > dl-group/cgroup.procs 2> /dev/null ACCEPTED=$? CPUSET_ALLOWED=`cat /proc/$CPUSET_PID/status | grep Cpus_allowed_list | awk '{print $2}'` if [ $ACCEPTED == 0 ]; then echo FAIL: a DL task was accepted on a non-exclusive cpuset else echo PASS: DL task was rejected on a non-exclusive cpuset fi if [ $ROOT_ALLOWED == $CPUSET_ALLOWED ]; then echo PASS: the affinity did not change: $CPUSET_ALLOWED == $ROOT_ALLOWED else echo FAIL: the cpu affinity is different: $CPUSET_ALLOWED == $ROOT_ALLOWED fi # Just ignore the clean up exec > /dev/null 2>&1 kill -9 $CPUSET_PID kill -9 $ROOT_PID rmdir dl-group ----- >% ----- Shows these results: ----- %< ----- LOG: create a cpuset and assigned the CPU 0 to it LOG: dispatching the first DL task LOG: dispatching the second DL task LOG: moving the second DL task to the cpuset FAIL: a DL task was accepted on a non-exclusive cpuset FAIL: the cpu affinity is different: 0 == 0-3 ----- >% ----- This result is a problem because the two tasks have a different cpu mask, but they end up sharing the cpu 0, which is something not supported in the current SCHED_DEADLINE designed (APA - Arbitrary Processor Affinities). To avoid such scenario, the correct action to be taken is rejecting the attach of SCHED_DEADLINE thread to a non-exclusive cpuset. With the proposed patch in place, the script above returns: ----- %< ----- LOG: create a cpuset and assigned the CPU 0 to it LOG: dispatching the first DL task LOG: dispatching the second DL task LOG: moving the second DL task to the cpuset PASS: DL task was rejected on a non-exclusive cpuset PASS: the affinity did not change: 0-3 == 0-3 ----- >% ----- Still, likewise for taskset, this restriction can be bypassed by disabling the admission test, i.e.: # sysctl -w kernel.sched_rt_runtime_us=-1 and work at their own risk. Reported-by: Marco Perronet <perronet@xxxxxxxxxxx> Signed-off-by: Daniel Bristot de Oliveira <bristot@xxxxxxxxxx> Cc: Ingo Molnar <mingo@xxxxxxxxxx> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> Cc: Juri Lelli <juri.lelli@xxxxxxxxxx> Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx> Cc: Dietmar Eggemann <dietmar.eggemann@xxxxxxx> Cc: Steven Rostedt <rostedt@xxxxxxxxxxx> Cc: Ben Segall <bsegall@xxxxxxxxxx> Cc: Mel Gorman <mgorman@xxxxxxx> Cc: Daniel Bristot de Oliveira <bristot@xxxxxxxxxx> Cc: Li Zefan <lizefan@xxxxxxxxxx> Cc: Tejun Heo <tj@xxxxxxxxxx> Cc: Johannes Weiner <hannes@xxxxxxxxxxx> Cc: Valentin Schneider <valentin.schneider@xxxxxxx> Cc: linux-kernel@xxxxxxxxxxxxxxx Cc: cgroups@xxxxxxxxxxxxxxx --- kernel/sched/deadline.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 788a391657a5..c221e14d5b86 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -2878,6 +2878,13 @@ int dl_task_can_attach(struct task_struct *p, if (cpumask_empty(cs_cpus_allowed)) return 0; + /* + * Do not allow moving tasks to non-exclusive cpusets + * if bandwidth control is enabled. + */ + if (dl_bandwidth_enabled() && !exclusive) + return -EBUSY; + /* * The task is not moving to another root domain, so it is * already accounted. -- 2.29.2