Re: [PATCH 2/2] sched/deadline: Correctly account for allocated bandwidth during hotplug

Waiman Long <llong@xxxxxxxxxx> · Wed, 13 Nov 2024 11:06:24 -0500

On 11/13/24 7:57 AM, Juri Lelli wrote:
For hotplug operations, DEADLINE needs to check that there is still enough
bandwidth left after removing the CPU that is going offline. We however
fail to do so currently.

Restore the correct behavior by restructuring dl_bw_manage() a bit, so
that overflow conditions (not enough bandwidth left) are properly
checked. Also account for dl_server bandwidth, i.e. discount such
bandwidht in the calculation since NORMAL tasks will be anyway moved
away from the CPU as a result of the hotplug operation.

Signed-off-by: Juri Lelli <juri.lelli@xxxxxxxxxx>
---
  kernel/sched/core.c     |  2 +-
  kernel/sched/deadline.c | 33 ++++++++++++++++++++++++---------
  kernel/sched/sched.h    |  2 +-
  3 files changed, 26 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 43e453ab7e20..d1049e784510 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8057,7 +8057,7 @@ static void cpuset_cpu_active(void)
  static int cpuset_cpu_inactive(unsigned int cpu)
  {
  	if (!cpuhp_tasks_frozen) {
-		int ret = dl_bw_check_overflow(cpu);
+		int ret = dl_bw_deactivate(cpu);
  
  		if (ret)
  			return ret;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index e53208a50279..609685c5df05 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -3467,29 +3467,31 @@ int dl_cpuset_cpumask_can_shrink(const struct cpumask *cur,
  }
  
  enum dl_bw_request {
-	dl_bw_req_check_overflow = 0,
+	dl_bw_req_deactivate = 0,
  	dl_bw_req_alloc,
  	dl_bw_req_free
  };
  
  static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw)
  {
-	unsigned long flags;
+	unsigned long flags, cap;
  	struct dl_bw *dl_b;
  	bool overflow = 0;
+	u64 fair_server_bw = 0;
  
  	rcu_read_lock_sched();
  	dl_b = dl_bw_of(cpu);
  	raw_spin_lock_irqsave(&dl_b->lock, flags);
  
-	if (req == dl_bw_req_free) {
+	cap = dl_bw_capacity(cpu);
+	switch (req) {
+	case dl_bw_req_free:
  		__dl_sub(dl_b, dl_bw, dl_bw_cpus(cpu));
-	} else {
-		unsigned long cap = dl_bw_capacity(cpu);
-
+		break;
+	case dl_bw_req_alloc:
  		overflow = __dl_overflow(dl_b, cap, 0, dl_bw);
  
-		if (req == dl_bw_req_alloc && !overflow) {
+		if (!overflow) {
  			/*
  			 * We reserve space in the destination
  			 * root_domain, as we can't fail after this point.
@@ -3498,6 +3500,19 @@ static int dl_bw_manage(enum dl_bw_request req, int cpu, u64 dl_bw)
  			 */
  			__dl_add(dl_b, dl_bw, dl_bw_cpus(cpu));
  		}
+		break;
+	case dl_bw_req_deactivate:
+		/*
+		 * cpu is going offline and NORMAL tasks will be moved away
+		 * from it. We can thus discount dl_server bandwidth
+		 * contribution as it won't need to be servicing tasks after
+		 * the cpu is off.
+		 */
+		if (cpu_rq(cpu)->fair_server.dl_server)
+			fair_server_bw = cpu_rq(cpu)->fair_server.dl_bw;
+
+		overflow = __dl_overflow(dl_b, cap, fair_server_bw, 0);
+		break;

This part can still cause a failure in one of test cases in my cpuset 
partition test script. In this particular case, the CPU to be offlined 
is an isolated CPU with scheduling disabled. As a result, total_bw is 0 
and the __dl_overflow() test failed. Is there a way to skip the 
__dl_overflow() test for isolated CPUs? Can we use a null total_bw as a 
proxy for that?

Thanks,
Longman